欢迎来到课桌文档! | 帮助中心 课桌文档-建筑工程资料库
课桌文档
全部分类
  • 党建之窗>
  • 感悟体会>
  • 百家争鸣>
  • 教育整顿>
  • 文笔提升>
  • 热门分类>
  • 计划总结>
  • 致辞演讲>
  • 在线阅读>
  • ImageVerifierCode 换一换
    首页 课桌文档 > 资源分类 > DOCX文档下载  

    【英文原版】StableDiffusion3技术报告-英.docx

    • 资源ID:1373411       资源大小:934.01KB        全文页数:30页
    • 资源格式: DOCX        下载积分:5金币
    快捷下载 游客一键下载
    会员登录下载
    三方登录下载: 微信开放平台登录 QQ登录  
    下载资源需要5金币
    邮箱/手机:
    温馨提示:
    用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)
    支付方式: 支付宝    微信支付   
    验证码:   换一换

    加入VIP免费专享
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    【英文原版】StableDiffusion3技术报告-英.docx

    ScalingRectifledFlowTransformersforHigh-ResolutionImageSynthesisPatrickEsser'SumithKulalAndreasBlattmannRahimEntezariJonasMu,llerHarrySainiYam1.eviDominik1.orenzAxelSauerFredericBoeselDustinPodelITimDockhornZionEnglishKyle1.aceyAlexGoodwinYannikMarekRobinRombach*StabilityAIFigure1.High-resolutionsamplesfromour8Brectifiedflowmodel,showcasingitscapabilitiesintypography,precisepromptfollowingandspatialreasoning,attentiontofinedetails,andhighimagequalityacrossawidevarietyofstyles.AbstractDiffusionmodelscreatedatafromnoisebyinvertingtheforwardpathsofdatatowardsnoiseandhaveemergedasapowerfulgenerativemodelingtechniqueforhigh-dimensional,perceptualdatasuchasimagesandvideos.Rectifiedflowisarecentgenerativemodelformulationthatconnectsdataandnoiseinastraightline.Despiteitsbettertheoreticalpropertiesandconceptualsimplicity,itisnotyetdecisivelyestablishedasstandardpractice.Inthiswork,weimproveexistingnoisesamplingtechniquesfbrtrainingrectifiedflowmodelsbybiasingthemtowardsperceptuallyrelevantscales.Throughalarge-scalestudy,wedemon-4Equalcontribution.<firstlast>stability.ai.stratethesuperiorperformanceofthisapproachcomparedtoestablisheddiffusionformulationsforhigh-resolutiontext-to-imagesynthesis.Additionally,wepresentanoveltransformer-basedarchitecturefortext-to-imagegenerationthatusesseparateweightsforthetwomodalitiesandenablesabidirectionalflowofinformationbetweenimageandtexttokens,improvingtextcomprehension,typography,andhumanpreferenceratings.Wedemonstratethatthisarchitecturefollowspredictablescalingtrendsandcorrelateslowervalidationlosstoimprovedtext-to-imagesynthesisasmeasuredbyvariousmetricsandhumanevaluations.Ourlargestmodelsoutperformstate-of-the-artmodels,andwewillmakeourexperimentaldata,code,andmodelweightspubliclyavailable.1. IntroductionDiffusionmodelscreatedatafromnoise(Songetal.,2020).Theyaretrainedtoinvertforwardpathsofdatatowardsrandomnoiseand,thus,inconjunctionwithapproximationandgeneralizationpropertiesofneuralnetworks,canbeusedtogeneratenewdatapointsthatarenotpresentinthetrainingdatabutfollowthedistributionofthetrainingdata(Sohl-Dicksteinetal.,2015;Song&Ermon,2020).Thisgenerativemodelingtechniquehasproventobeveryeffectiveformodelinghigh-dimensional,perceptualdatasuchasimages(HOetal.,2020).Inrecentyears,diffusionmodelshavebecomethede-factoapproachforgeneratinghigh-resolutionimagesandvideosfromnaturallanguageinputswithimpressivegeneralizationcapabilities(Sahariaetal.,2022b;Rameshetal.,2022;Rombachetal.,2022;Podelletal.,2023;Daietal.,2023;Esseretal.,2023;Blattmannetal.,2023b;Betkeretal.,2023;Blattmannetal.,2023a;Singeretal.l2022).Duetotheiriterativenatureandtheassociatedcomputationalcosts,aswellasthelongsamplingtimesduringinference,researchonformulationsformoreefficienttrainingand/orfastersamplingofthesemodelshasincreased(Karrasetal.,2023;1.iuetal.,2022).Whilespecifyingaforwardpathfromdatatonoiseleadstoefficienttraining,italsoraisesthequestionofwhichpathtochoose.Thischoicecanhaveimportantimplicationsforsampling.Forexample,aforwardprocessthatfailstoremoveallnoisefromthedatacanleadtoadiscrepancyintrainingandtestdistributionandresultinartifactssuchasgrayimagesamples(1.inetal.,2024).Importantly,thechoiceoftheforwardprocessalsoinfluencesthelearnedbackwardprocessand,thus,thesamplingefficiency.Whilecurvedpathsrequiremanyintegrationstepstosimulatetheprocess,astraightpathcouldbesimulatedwithasinglestepandislesspronetoerroraccumulation.Sinceeachstepcorrespondstoanevaluationoftheneuralnetwork,thishasadirectimpactonthesamplingspeed.Aparticularchoicefortheforwardpathisaso-calledRectifiedFlow(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023),whichconnectsdataandnoiseonastraightline.Althoughthismodelclasshasbettertheoreticalproperties,ithasnotyetbecomedecisivelyestablishedinpractice.Sofar,someadvantageshavebeenempiricallydemonstratedinsmallandmedium-sizedexperiments(Maetal.,2024),butthesearemostlylimitedtoclass-conditionalmodels.Inthiswork,wechangethisbyintroducingare-weightingofthenoisescalesinrectifiedflowmodels,similartonoise-predictivediffusionmodels(Hoetal.,2020).Throughalarge-scalestudy,wecompareournewformulationtoexistingdiffusionformulationsanddemonstrateitsbenefits.Weshowthatthewidelyusedapproachfortext-to-imagesynthesis,whereafixedtextrepresentationisfeddirectlyintothemodel(e.g.,viacross-attention(Vaswanietal.,2017;Rombachetal.,2022),isnotideal,andpresentanewarchitecturethatincorporatesIeamablestreamsforbothimageandtexttokens,whichenablesatwo-wayflowOfinformationbetweenthem.Wecombinethiswithourimprovedrectifiedflowformulationandinvestigateitsscalability.Wedemonstrateapredictablescalingtrendinthevalidationlossandshowthatalowervalidationlosscorrelatesstronglywithimprovedautomaticandhumanevaluations.Ourlargestmodelsoutperformstate-of-theartopenmodelssuchasSDX1.(Podelletal.,2023),SDX1.-Turbo(Saueretal.,2023),Pixart-(Chenetal.,2023),andclosed-sourcemodelssuchasDA1.1.-E3(Betkeretal.,2023)bothinquantitativeevaluation(Ghoshetal.,2023)ofpromptunderstandingandhumanpreferenceratings.Thecorecontributionsofourworkare:(i)Weconductalarge-scale,systematicstudyondifferentdiffusionmodelandrectifiedflowformulationstoidentifythebestsetting.Forthispurpose,weintroducenewnoisesamplersforrectifiedflowmodelsthatimproveperformanceoverpreviouslyknownsamplers,(ii)Wedeviseanovel,scalablearchitecturefortext-to-imagesynthesisthatallowsbi-directionalmixingbetweentextandimagetokenstreamswithinthenetwork.WeshowitsbenefitscomparedtoestablishedbackbonessuchasUViT(Hoogeboometal,2023)andDiT(Peebles&Xie,2023).Finally,we(iii)performascalingstudyofourmodelanddemonstratethatitfollowspredictablescalingtrends.Weshowthatalowervalidationlosscorrelatesstronglywithimprovedtext-to-imageperformanceassessedviametricssuchasT2I-CompBench(Huangetal.,2023),GenEval(Ghoshetal.,2023)andhumanratings.Wemakeresults,code,andmodelweightspubliclyavailable.2. Simulation-FreeTrainingofFlowsWeconsidergenerativemodelsthatdefineamappingbetweensamplesifromanoisedistributionPltosamplesxofromadatadistributionpointermsofanordinarydifferentialequation(ODE),dyt=v-,r)dt,(1)wherethevelocityvisparameterizedbytheweightsofaneuralnetwork.PriorworkbyChenetal.(2018)suggestedtodirectlysolveEquation(1)viadifferentiableODEsolvers.However,thisprocessiscomputationallyexpensive,especiallyforlargenetworkarchitecturesthatparameterizev-(,t.t).AmoreefficientalternativeistodirectlyregressavectorfieldwtthatgeneratesaprobabilitypathbetweenPOandp.Toconstructsuchaux,wedefineaforwardprocess,correspondingtoaprobabilitypathPtbetweenpoandPl=N(0,1),aszt=auo+btawhere»<-N(0,/).(2)Forao=1,Z?o=O,a=Oandb=1,themarginals,Pt(Zt)=ESN(M)Pt(Zt0,(3)areconsistentwiththedataandnoisedistribution.Toexpresstherelationshipbetweent>xoand,weintroducetanduxast():xo'ato+Z¾(4)Mze):=ItT(Z付(5)SinceZtcanbewrittenassolutiontotheODEzt'="t(ZtI£),withinitialvaluezo=xo,wt()generatespt(e).Remarkably,onecanconstructamarginalvectorfield“twhichgeneratesthemarginalprobabilitypaths(1.ipmanetal.,2023)(seeB.l),usingtheconditionalvectorfields11t():(z)=EufzelAtUl(6)tSN(OJ)八Pt(Z)WhileregressingwlwiththeFlowMatchingobjective1.FM=Et,pt(z)Hv-(z,z)wt(z)112.(7)directlyisintractableduetothemarginalizationinEquation6,ConditionalFlowMatching(seeB.l),1.CFM=Et,p,(z/e),pIW-Qt)Mt(z)22.(8)withtheconditionalvectorfieldsMt(Z¢)providesanequivalentyettractableobjective.ToconvertthelossintoanexplicitformWeinsert/(xo)=cixq+heandt_1(z)=¾ycinto(5)Ztz=NZtl£)=-Zi_£庆(一4)(9)tv,jatatbtNow,CQnSi尊thesignal-to-noiseratioA:=ogWith,=2(a),wecanrewriteEquation(9)astarz(ze)=tz,(10)ttTt?lCli乙Next,weuseEquation(10)toreparameterizeEquation(8)asanoise-predictionobjective:1.=Evz,i)-az+,e2(11)CFMt,pt(ze),p(e)M2t2rJ&,.22_Et,pjze),p(e)2-(z,/)e2储)wherewedenedc:=2(v6z).T人组募Notethattheoptimumoftheaboveobjectivedoesnotchangewhenintroducingatime-dependentweighting.Thus,onecanderivevariousweightedlossfunctionsthatprovideasignaltowardsthedesiredsolutionbutmightaffecttheoptimizationtrajectory.Foraunifiedanalysisofdifferentapproaches,includingclassicdiffusionformulations,wecanwritetheobjectiveinthefollowingform(followingKingma&Gao(2023):T,H-1.w(x0)=-2EU(t),SN(0,1)VYMJ(Z/)£12,where1'2correspondsto.Wt=-ZAE1.CFM3. FlowTrajectoriesInthiswork,weconsiderdifferentvariantsoftheaboveformalismthatwebrieflydescribeinthefollowing.RectifiedFlowRectifiedFlows(RFs)(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023)definetheforwardprocessasstraightpathsbetweenthedatadistributionandastandardnormaldistribution,i.e.zt=(1r)xo+te,(13)anduses1.CFMwhichthencorrespondstovvjJf=rl.Thenetworkoutputdirectlyparameterizesthevelocityv-.EDMEDM(Karrasetal.,2022)usesaforwardprocessoftheformzt=xo+bt(14)WherJ(Kingma&Gao,2023)bt=exp/7T1(Pm,P)withFjbeingthequantilefunctionofthenormaldistributionwithmeanPmandvarianceP2.NotethatthischoiceSresultsinAN(2Pm,(2A)z)fort-U(0,1)(15)ThenetworkisparameterizedthroughanF-prediction(Kingma&Gao,2023;Karrasetal.,2022)andthelosscanbewrittenas1.weomwithtWqDM=N(42Pm,(2Ps)2)(ef<+o.52)(16)Cosine(Nichol&Dhariwal,2021)proposedaforwardprocessoftheform1111,«一Zt=cosrxo+sin/.(17)Incombinationwithane-parameterizationandloss,thiscorrespondstoaweighting1伏=SeChat/2).Whencombinedwithav-predictionloss(Kingma&Gaof2023),theweightingisgivenbyWt=ei2.(1.DM-)1.inear1.DM(Rombachetal.,2022)usesamodificationoftheDDPMschedule(Hoetalg020).BOtharevariancepreservingschedules,i.e.bt=%anddefineatfordiscretetimestepst=0,.T1intermsofdiffusioncoefficientsBtasat=(:=o(lA),Forgivenboundaryvaluesoand-,DDPMusesBt=6。+t1So)and1.DMusesBt="Yp*2”防十占位际一3.1. TailoredSNRSamplersforRFmodelsTheRFlosstrainsthevelocityv-,uniformlyonalltimestepsin0,1.Intuitively,however,theresultingvelocitypredictiontargetcxoismoredifficultfortinthemiddleof0.1,sincefort=Q,theoptimalpredictionisthemeanofpi,andfor/=1theoptimalpredictionisthemeanofpo.Ingeneral,changingthedistributionovertfromthecommonlyuseduniformdistributionU。)toadistributionwithdensityt)isequivalenttoaweightedloss1.wwith¼t=1,77W(18)1-IThus,weaimtogivemoreweighttointermediatetimestepsbysamplingthemmorefrequently.Next,wedescribethetimestepdensities()thatweusetotrainourmodels.1.ogit-NormalSamplingOneoptionforadistributionthatputsmoreweightonintermediatestepsisthelogitnormaldistribution(Atchison&Shen,1980).Itsdensity,f11*(logit(r)_my*-0;用,5)=,2面(1_,)exp-W'wherelogit()=logJ_,hasalocationparameter,w,l-tascaleparameter,s.ThelocationparameterenablesustobiasthetrainingtimestepstowardseitherdataPo(negativeniornoisePl(positiven).AsshowninFigure11,thescaleparameterscontrolshowwidethedistributionis.Inpractice,wesampletherandomvariableufromanormaldistributionu,几S)andmapitthroughthestandardlogisticfunction.ModeSamplingwithHeavyTailsThelogit-normaldensityalwaysvanishesattheendpointsOand1.Tostudywhetherthishasadverseeffectsontheperformance,wealsouseatimestepsamplingdistributionwithstrictlypositivedensityon0,1.Forascaleparameters,wedefinei11找,/mode(WJS)=IUS,COS?U1+M.(20)For-1s7thisfunctionismonotonic,andwecanuseittosamplefromtheimplieddensityrrmode(cS)=目"温M).AsseeninFigure11,thescaleparametercontrolsthedegreetowhicheitherthemidpoint(positiveS)ortheendpoints(negativeS)arefavoredduringsampling.Thisformulationalsoincludesauniformweighting"mode。;s=O)=U(f)fors=O,whichhasbeenusedwidelyinpreviousworksonRectifiedFlows(1.iuetal.,2022;Maetal.,2024).CosMapFinally,wealsoconsiderthecosineschedule(Nichol&Dhariwal,2021)fromSection3intheRFsetting.Inparticular,wearelookingforamapping/:u'(m)=t,ul(1.11.suchthatthelos-snrmatchesthatofthecosineschedUej21ogcosG=勿i.SolvingtorAwesin(u)f(u)obtainforuU(m)Z=小)=1一_(21)tan(2m)+1fromwhichweobtainthedensityd2"CosMapQ)=_广1。)=.(22)dt11-211+211t24. Text-to-ImageArchitectureFortext-conditionalsamplingofimages,ourmodelhastotakebothmodalities,textandimages,intoaccount.Weusepretrainedmodelstoderivesuitablerepresentationsandthendescribethearchitectureofourdiffusionbackbone.AnoverviewofthisispresentedinFigure2.Ourgeneralsetupfollows1.DM(Rombachetal.f2022)fortrainingtext-to-imagemodelsinthelatentspaceofapretrainedautoencoder.Similartotheencodingofimagestolatentrepresentations,wealsofollowpreviousapproaches(Sahariaetal.,2022b;Balajietal.,2022)andencodethetextconditioningcusingpretrained,frozentextmodels.DetailscanbefoundinAppendixB.2.MultimodalDiffusionBackboneOurarchitecturebuildsupontheDiT(Peebles&Xie,2023)architecture.DiTonlyconsidersclassconditionalimagegenerationandusesamodulationmechanismtoconditionthenetworkonboththetimestepofthediffusionprocessandtheclasslabel.Similarly,weuseembeddingsofthetimesteptandCVeCasinputstothemodulationmechanism.However,asthepooledtextrepresentationretainsonlycoarse-grainedinformationaboutthetextinput(Podelletal.,2023),thenetworkalsorequiresinformationfromthesequencerepresentationCCtXtWeconstructasequenceconsistingofembeddingsofthetextandimageinputs.Specifically,weaddpositionalencodingsandflatten2×2patchesofthelatentpixelrepresentationxRhXWXCtoapatchencodingsequenceoflength17w.AfterembeddingthispatchencodingandthetextencodingCCtxttoacommondimensionality,weCaPtiOn)(八)Overviewofallcomponents.Figure2.Ourmodelarchitecture.Concatenationisindicatedbyandelement-wisemultiplicationby.TheRMS-NormforQandKcanbeaddedtostabilizetrainingruns.Bestviewedzoomedin.(b)OneAfM-DzTblockconcatenatethetwosequences.WethenfollowDiTandapplyasequenceofmodulatedattentionandM1.Ps.Sincetextandimageembeddingsareconceptuallyquitedifferent,weusetwoseparatesetsofweightsforthetwomodalities.AsshowninFigure2b,thisisequivalenttohavingtwoindependenttransformersforeachmodality,butjoiningthesequencesofthetwomodalitiesfortheattentionoperation,suchthatbothrepresentationscanworkintheirownspaceyettaketheotheroneintoaccountForourscalingexperiments,Weparameterizethesizeofthemodelintermsofthemodesdepthd,i.e.thenumberofattentionblocks,bysettingthehiddensizeto64d(expandedto464channelsintheM1.Pblocks),andthenumberofattentionheadsequaltod.5. Experiments5.1. ImprovingRectifiedFlowsWeaimtounderstandwhichoftheapproachesforsimulation-freetrainingofnormalizingflowsasinEquation1isthemostefficientToenablecomparisonsacrossdifferentapproaches,wecontrolfortheoptimizationalgorithm,themodelarchitecture,thedatasetandsamplers.Inaddition,thelossesofdifferentapproachesareincomparableandalsodonotnecessarilycorrelatewiththequalityofoutputsamples;henceweneedevaluationmetricsthatallowforacomparisonbetweenapproaches.WetrainmodelsonIma-geNet(Russakovskyetal.,2014)andCC12M(Changpinyoetal.,2021),andevaluateboththetrainingandtheEMAweightsofthemodelsduringtrainingusingvalidationlosses,C1.IPscores(Radfordetal.,2021;Hesseletal”2021)fandFlD(Heuseletal.f2017)underdifferentsamplersettings(differentguidancescalesandsamplingsteps).Wecalc

    注意事项

    本文(【英文原版】StableDiffusion3技术报告-英.docx)为本站会员(夺命阿水)主动上传,课桌文档仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知课桌文档(点击联系客服),我们立即给予删除!

    温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。




    备案号:宁ICP备20000045号-1

    经营许可证:宁B2-20210002

    宁公网安备 64010402000986号

    课桌文档
    收起
    展开