stb_image_resize2.h (451105B)
1 /* stb_image_resize2 - v2.12 - public domain image resizing 2 3 by Jeff Roberts (v2) and Jorge L Rodriguez 4 http://github.com/nothings/stb 5 6 Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only 7 scaling and translation is supported, no rotations or shears. 8 9 COMPILING & LINKING 10 In one C/C++ file that #includes this file, do this: 11 #define STB_IMAGE_RESIZE_IMPLEMENTATION 12 before the #include. That will create the implementation in that file. 13 14 EASY API CALLS: 15 Easy API downsamples w/Mitchell filter, upsamples w/cubic interpolation, clamps to edge. 16 17 stbir_resize_uint8_srgb( input_pixels, input_w, input_h, input_stride_in_bytes, 18 output_pixels, output_w, output_h, output_stride_in_bytes, 19 pixel_layout_enum ) 20 21 stbir_resize_uint8_linear( input_pixels, input_w, input_h, input_stride_in_bytes, 22 output_pixels, output_w, output_h, output_stride_in_bytes, 23 pixel_layout_enum ) 24 25 stbir_resize_float_linear( input_pixels, input_w, input_h, input_stride_in_bytes, 26 output_pixels, output_w, output_h, output_stride_in_bytes, 27 pixel_layout_enum ) 28 29 If you pass NULL or zero for the output_pixels, we will allocate the output buffer 30 for you and return it from the function (free with free() or STBIR_FREE). 31 As a special case, XX_stride_in_bytes of 0 means packed continuously in memory. 32 33 API LEVELS 34 There are three levels of API - easy-to-use, medium-complexity and extended-complexity. 35 36 See the "header file" section of the source for API documentation. 37 38 ADDITIONAL DOCUMENTATION 39 40 MEMORY ALLOCATION 41 By default, we use malloc and free for memory allocation. To override the 42 memory allocation, before the implementation #include, add a: 43 44 #define STBIR_MALLOC(size,user_data) ... 45 #define STBIR_FREE(ptr,user_data) ... 46 47 Each resize makes exactly one call to malloc/free (unless you use the 48 extended API where you can do one allocation for many resizes). Under 49 address sanitizer, we do separate allocations to find overread/writes. 50 51 PERFORMANCE 52 This library was written with an emphasis on performance. When testing 53 stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with 54 STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other resize 55 libs do by default). Also, make sure SIMD is turned on of course (default 56 for 64-bit targets). Avoid WRAP edge mode if you want the fastest speed. 57 58 This library also comes with profiling built-in. If you define STBIR_PROFILE, 59 you can use the advanced API and get low-level profiling information by 60 calling stbir_resize_extended_profile_info() or stbir_resize_split_profile_info() 61 after a resize. 62 63 SIMD 64 Most of the routines have optimized SSE2, AVX, NEON and WASM versions. 65 66 On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and 67 ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or 68 STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX 69 or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2 70 support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2. 71 72 On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For 32-bit, 73 we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2 enabled 74 on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4 for both 75 clang and GCC, but GCC also requires an additional -mfp16-format=ieee to 76 automatically enable NEON. 77 78 On x86 platforms, you can also define STBIR_FP16C to turn on FP16C instructions 79 for converting back and forth to half-floats. This is autoselected when we 80 are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses 81 the built-in half float hardware NEON instructions. 82 83 You can also tell us to use multiply-add instructions with STBIR_USE_FMA. 84 Because x86 doesn't always have fma, we turn it off by default to maintain 85 determinism across all platforms. If you don't care about non-FMA determinism 86 and are willing to restrict yourself to more recent x86 CPUs (around the AVX 87 timeframe), then fma will give you around a 15% speedup. 88 89 You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can turn 90 off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is 10% 91 to 40% faster, and AVX2 is generally another 12%. 92 93 ALPHA CHANNEL 94 Most of the resizing functions provide the ability to control how the alpha 95 channel of an image is processed. 96 97 When alpha represents transparency, it is important that when combining 98 colors with filtering, the pixels should not be treated equally; they 99 should use a weighted average based on their alpha values. For example, 100 if a pixel is 1% opaque bright green and another pixel is 99% opaque 101 black and you average them, the average will be 50% opaque, but the 102 unweighted average and will be a middling green color, while the weighted 103 average will be nearly black. This means the unweighted version introduced 104 green energy that didn't exist in the source image. 105 106 (If you want to know why this makes sense, you can work out the math for 107 the following: consider what happens if you alpha composite a source image 108 over a fixed color and then average the output, vs. if you average the 109 source image pixels and then composite that over the same fixed color. 110 Only the weighted average produces the same result as the ground truth 111 composite-then-average result.) 112 113 Therefore, it is in general best to "alpha weight" the pixels when applying 114 filters to them. This essentially means multiplying the colors by the alpha 115 values before combining them, and then dividing by the alpha value at the 116 end. 117 118 The computer graphics industry introduced a technique called "premultiplied 119 alpha" or "associated alpha" in which image colors are stored in image files 120 already multiplied by their alpha. This saves some math when compositing, 121 and also avoids the need to divide by the alpha at the end (which is quite 122 inefficient). However, while premultiplied alpha is common in the movie CGI 123 industry, it is not commonplace in other industries like videogames, and most 124 consumer file formats are generally expected to contain not-premultiplied 125 colors. For example, Photoshop saves PNG files "unpremultiplied", and web 126 browsers like Chrome and Firefox expect PNG images to be unpremultiplied. 127 128 Note that there are three possibilities that might describe your image 129 and resize expectation: 130 131 1. images are not premultiplied, alpha weighting is desired 132 2. images are not premultiplied, alpha weighting is not desired 133 3. images are premultiplied 134 135 Both case #2 and case #3 require the exact same math: no alpha weighting 136 should be applied or removed. Only case 1 requires extra math operations; 137 the other two cases can be handled identically. 138 139 stb_image_resize expects case #1 by default, applying alpha weighting to 140 images, expecting the input images to be unpremultiplied. This is what the 141 COLOR+ALPHA buffer types tell the resizer to do. 142 143 When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB, 144 STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are 145 non-premultiplied. In these cases, the resizer will alpha weight the colors 146 (effectively creating the premultiplied image), do the filtering, and then 147 convert back to non-premult on exit. 148 149 When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RGBA_PM, 150 STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels 151 ARE premultiplied. In this case, the resizer doesn't have to do the 152 premultipling - it can filter directly on the input. This about twice as 153 fast as the non-premultiplied case, so it's the right option if your data is 154 already setup correctly. 155 156 When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are 157 telling us that there is no channel that represents transparency; it may be 158 RGB and some unrelated fourth channel that has been stored in the alpha 159 channel, but it is actually not alpha. No special processing will be 160 performed. 161 162 The difference between the generic 4 or 2 channel layouts, and the 163 specialized _PM versions is with the _PM versions you are telling us that 164 the data *is* alpha, just don't premultiply it. That's important when 165 using SRGB pixel formats, we need to know where the alpha is, because 166 it is converted linearly (rather than with the SRGB converters). 167 168 Because alpha weighting produces the same effect as premultiplying, you 169 even have the option with non-premultiplied inputs to let the resizer 170 produce a premultiplied output. Because the intially computed alpha-weighted 171 output image is effectively premultiplied, this is actually more performant 172 than the normal path which un-premultiplies the output image as a final step. 173 174 Finally, when converting both in and out of non-premulitplied space (for 175 example, when using STBIR_RGBA), we go to somewhat heroic measures to 176 ensure that areas with zero alpha value pixels get something reasonable 177 in the RGB values. If you don't care about the RGB values of zero alpha 178 pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality() 179 function - this runs a premultiplied resize about 25% faster. That said, 180 when you really care about speed, using premultiplied pixels for both in 181 and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied 182 options. 183 184 PIXEL LAYOUT CONVERSION 185 The resizer can convert from some pixel layouts to others. When using the 186 stbir_set_pixel_layouts(), you can, for example, specify STBIR_RGBA 187 on input, and STBIR_ARGB on output, and it will re-organize the channels 188 during the resize. Currently, you can only convert between two pixel 189 layouts with the same number of channels. 190 191 DETERMINISM 192 We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc). 193 This requires compiling with fast-math off (using at least /fp:precise). 194 Also, you must turn off fp-contracting (which turns mult+adds into fmas)! 195 We attempt to do this with pragmas, but with Clang, you usually want to add 196 -ffp-contract=off to the command line as well. 197 198 For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is, 199 if the scalar x87 unit gets used at all, we immediately lose determinism. 200 On Microsoft Visual Studio 2008 and earlier, from what we can tell there is 201 no way to be deterministic in 32-bit x86 (some x87 always leaks in, even 202 with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and 203 -fpmath=sse. 204 205 Note that we will not be deterministic with float data containing NaNs - 206 the NaNs will propagate differently on different SIMD and platforms. 207 208 If you turn on STBIR_USE_FMA, then we will be deterministic with other 209 fma targets, but we will differ from non-fma targets (this is unavoidable, 210 because a fma isn't simply an add with a mult - it also introduces a 211 rounding difference compared to non-fma instruction sequences. 212 213 FLOAT PIXEL FORMAT RANGE 214 Any range of values can be used for the non-alpha float data that you pass 215 in (0 to 1, -1 to 1, whatever). However, if you are inputting float values 216 but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we 217 scale back properly. The alpha channel must also be 0 to 1 for any format 218 that does premultiplication prior to resizing. 219 220 Note also that with float output, using filters with negative lobes, the 221 output filtered values might go slightly out of range. You can define 222 STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range 223 to clamp to on output, if that's important. 224 225 MAX/MIN SCALE FACTORS 226 The input pixel resolutions are in integers, and we do the internal pointer 227 resolution in size_t sized integers. However, the scale ratio from input 228 resolution to output resolution is calculated in float form. This means 229 the effective possible scale ratio is limited to 24 bits (or 16 million 230 to 1). As you get close to the size of the float resolution (again, 16 231 million pixels wide or high), you might start seeing float inaccuracy 232 issues in general in the pipeline. If you have to do extreme resizes, 233 you can usually do this is multiple stages (using float intermediate 234 buffers). 235 236 FLIPPED IMAGES 237 Stride is just the delta from one scanline to the next. This means you can 238 use a negative stride to handle inverted images (point to the final 239 scanline and use a negative stride). You can invert the input or output, 240 using negative strides. 241 242 DEFAULT FILTERS 243 For functions which don't provide explicit control over what filters to 244 use, you can change the compile-time defaults with: 245 246 #define STBIR_DEFAULT_FILTER_UPSAMPLE STBIR_FILTER_something 247 #define STBIR_DEFAULT_FILTER_DOWNSAMPLE STBIR_FILTER_something 248 249 See stbir_filter in the header-file section for the list of filters. 250 251 NEW FILTERS 252 A number of 1D filter kernels are supplied. For a list of supported 253 filters, see the stbir_filter enum. You can install your own filters by 254 using the stbir_set_filter_callbacks function. 255 256 PROGRESS 257 For interactive use with slow resize operations, you can use the the 258 scanline callbacks in the extended API. It would have to be a *very* large 259 image resample to need progress though - we're very fast. 260 261 CEIL and FLOOR 262 In scalar mode, the only functions we use from math.h are ceilf and floorf, 263 but if you have your own versions, you can define the STBIR_CEILF(v) and 264 STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use 265 our own versions. 266 267 ASSERT 268 Define STBIR_ASSERT(boolval) to override assert() and not use assert.h 269 270 PORTING FROM VERSION 1 271 The API has changed. You can continue to use the old version of stb_image_resize.h, 272 which is available in the "deprecated/" directory. 273 274 If you're using the old simple-to-use API, porting is straightforward. 275 (For more advanced APIs, read the documentation.) 276 277 stbir_resize_uint8(): 278 - call `stbir_resize_uint8_linear`, cast channel count to `stbir_pixel_layout` 279 280 stbir_resize_float(): 281 - call `stbir_resize_float_linear`, cast channel count to `stbir_pixel_layout` 282 283 stbir_resize_uint8_srgb(): 284 - function name is unchanged 285 - cast channel count to `stbir_pixel_layout` 286 - above is sufficient unless your image has alpha and it's not RGBA/BGRA 287 - in that case, follow the below instructions for stbir_resize_uint8_srgb_edgemode 288 289 stbir_resize_uint8_srgb_edgemode() 290 - switch to the "medium complexity" API 291 - stbir_resize(), very similar API but a few more parameters: 292 - pixel_layout: cast channel count to `stbir_pixel_layout` 293 - data_type: STBIR_TYPE_UINT8_SRGB 294 - edge: unchanged (STBIR_EDGE_WRAP, etc.) 295 - filter: STBIR_FILTER_DEFAULT 296 - which channel is alpha is specified in stbir_pixel_layout, see enum for details 297 298 FUTURE TODOS 299 * For polyphase integral filters, we just memcpy the coeffs to dupe 300 them, but we should indirect and use the same coeff memory. 301 * Add pixel layout conversions for sensible different channel counts 302 (maybe, 1->3/4, 3->4, 4->1, 3->1). 303 * For SIMD encode and decode scanline routines, do any pre-aligning 304 for bad input/output buffer alignments and pitch? 305 * For very wide scanlines, we should we do vertical strips to stay within 306 L2 cache. Maybe do chunks of 1K pixels at a time. There would be 307 some pixel reconversion, but probably dwarfed by things falling out 308 of cache. Probably also something possible with alternating between 309 scattering and gathering at high resize scales? 310 * Rewrite the coefficient generator to do many at once. 311 * AVX-512 vertical kernels - worried about downclocking here. 312 * Convert the reincludes to macros when we know they aren't changing. 313 * Experiment with pivoting the horizontal and always using the 314 vertical filters (which are faster, but perhaps not enough to overcome 315 the pivot cost and the extra memory touches). Need to buffer the whole 316 image so have to balance memory use. 317 * Most of our code is internally function pointers, should we compile 318 all the SIMD stuff always and dynamically dispatch? 319 320 CONTRIBUTORS 321 Jeff Roberts: 2.0 implementation, optimizations, SIMD 322 Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer 323 Fabian Giesen: half float and srgb converters 324 Sean Barrett: API design, optimizations 325 Jorge L Rodriguez: Original 1.0 implementation 326 Aras Pranckevicius: bugfixes 327 Nathan Reed: warning fixes for 1.0 328 329 REVISIONS 330 2.12 (2024-10-18) fix incorrect use of user_data with STBIR_FREE 331 2.11 (2024-09-08) fix harmless asan warnings in 2-channel and 3-channel mode 332 with AVX-2, fix some weird scaling edge conditions with 333 point sample mode. 334 2.10 (2024-07-27) fix the defines GCC and mingw for loop unroll control, 335 fix MSVC 32-bit arm half float routines. 336 2.09 (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting 337 hardware half floats). 338 2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD (thanks 339 to Ryan Salsbury), fix for sub-rect resizes, use the 340 pragmas to control unrolling when they are available. 341 2.07 (2024-05-24) fix for slow final split during threaded conversions of very 342 wide scanlines when downsampling (caused by extra input 343 converting), fix for wide scanline resamples with many 344 splits (int overflow), fix GCC warning. 345 2.06 (2024-02-10) fix for identical width/height 3x or more down-scaling 346 undersampling a single row on rare resize ratios (about 1%). 347 2.05 (2024-02-07) fix for 2 pixel to 1 pixel resizes with wrap (thanks Aras), 348 fix for output callback (thanks Julien Koenen). 349 2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks Nikola Smiljanic). 350 2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor tweaks. 351 2.00 (2023-10-10) mostly new source: new api, optimizations, simd, vertical-first, etc 352 2x-5x faster without simd, 4x-12x faster with simd, 353 in some cases, 20x to 40x faster esp resizing large to very small. 354 0.96 (2019-03-04) fixed warnings 355 0.95 (2017-07-23) fixed warnings 356 0.94 (2017-03-18) fixed warnings 357 0.93 (2017-03-03) fixed bug with certain combinations of heights 358 0.92 (2017-01-02) fix integer overflow on large (>2GB) images 359 0.91 (2016-04-02) fix warnings; fix handling of subpixel regions 360 0.90 (2014-09-17) first released version 361 362 LICENSE 363 See end of file for license information. 364 */ 365 366 #if !defined(STB_IMAGE_RESIZE_DO_HORIZONTALS) && !defined(STB_IMAGE_RESIZE_DO_VERTICALS) && !defined(STB_IMAGE_RESIZE_DO_CODERS) // for internal re-includes 367 368 #ifndef STBIR_INCLUDE_STB_IMAGE_RESIZE2_H 369 #define STBIR_INCLUDE_STB_IMAGE_RESIZE2_H 370 371 #include <stddef.h> 372 #ifdef _MSC_VER 373 typedef unsigned char stbir_uint8; 374 typedef unsigned short stbir_uint16; 375 typedef unsigned int stbir_uint32; 376 typedef unsigned __int64 stbir_uint64; 377 #else 378 #include <stdint.h> 379 typedef uint8_t stbir_uint8; 380 typedef uint16_t stbir_uint16; 381 typedef uint32_t stbir_uint32; 382 typedef uint64_t stbir_uint64; 383 #endif 384 385 #ifdef _M_IX86_FP 386 #if ( _M_IX86_FP >= 1 ) 387 #ifndef STBIR_SSE 388 #define STBIR_SSE 389 #endif 390 #endif 391 #endif 392 393 #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) || defined(STBIR_SSE) || defined(STBIR_SSE2) 394 #ifndef STBIR_SSE2 395 #define STBIR_SSE2 396 #endif 397 #if defined(__AVX__) || defined(STBIR_AVX2) 398 #ifndef STBIR_AVX 399 #ifndef STBIR_NO_AVX 400 #define STBIR_AVX 401 #endif 402 #endif 403 #endif 404 #if defined(__AVX2__) || defined(STBIR_AVX2) 405 #ifndef STBIR_NO_AVX2 406 #ifndef STBIR_AVX2 407 #define STBIR_AVX2 408 #endif 409 #if defined( _MSC_VER ) && !defined(__clang__) 410 #ifndef STBIR_FP16C // FP16C instructions are on all AVX2 cpus, so we can autoselect it here on microsoft - clang needs -m16c 411 #define STBIR_FP16C 412 #endif 413 #endif 414 #endif 415 #endif 416 #ifdef __F16C__ 417 #ifndef STBIR_FP16C // turn on FP16C instructions if the define is set (for clang and gcc) 418 #define STBIR_FP16C 419 #endif 420 #endif 421 #endif 422 423 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__) 424 #ifndef STBIR_NEON 425 #define STBIR_NEON 426 #endif 427 #endif 428 429 #if defined(_M_ARM) || defined(__arm__) 430 #ifdef STBIR_USE_FMA 431 #undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC 432 #endif 433 #endif 434 435 #if defined(__wasm__) && defined(__wasm_simd128__) 436 #ifndef STBIR_WASM 437 #define STBIR_WASM 438 #endif 439 #endif 440 441 #ifndef STBIRDEF 442 #ifdef STB_IMAGE_RESIZE_STATIC 443 #define STBIRDEF static 444 #else 445 #ifdef __cplusplus 446 #define STBIRDEF extern "C" 447 #else 448 #define STBIRDEF extern 449 #endif 450 #endif 451 #endif 452 453 ////////////////////////////////////////////////////////////////////////////// 454 //// start "header file" /////////////////////////////////////////////////// 455 // 456 // Easy-to-use API: 457 // 458 // * stride is the offset between successive rows of image data 459 // in memory, in bytes. specify 0 for packed continuously in memory 460 // * colorspace is linear or sRGB as specified by function name 461 // * Uses the default filters 462 // * Uses edge mode clamped 463 // * returned result is 1 for success or 0 in case of an error. 464 465 466 // stbir_pixel_layout specifies: 467 // number of channels 468 // order of channels 469 // whether color is premultiplied by alpha 470 // for back compatibility, you can cast the old channel count to an stbir_pixel_layout 471 typedef enum 472 { 473 STBIR_1CHANNEL = 1, 474 STBIR_2CHANNEL = 2, 475 STBIR_RGB = 3, // 3-chan, with order specified (for channel flipping) 476 STBIR_BGR = 0, // 3-chan, with order specified (for channel flipping) 477 STBIR_4CHANNEL = 5, 478 479 STBIR_RGBA = 4, // alpha formats, where alpha is NOT premultiplied into color channels 480 STBIR_BGRA = 6, 481 STBIR_ARGB = 7, 482 STBIR_ABGR = 8, 483 STBIR_RA = 9, 484 STBIR_AR = 10, 485 486 STBIR_RGBA_PM = 11, // alpha formats, where alpha is premultiplied into color channels 487 STBIR_BGRA_PM = 12, 488 STBIR_ARGB_PM = 13, 489 STBIR_ABGR_PM = 14, 490 STBIR_RA_PM = 15, 491 STBIR_AR_PM = 16, 492 493 STBIR_RGBA_NO_AW = 11, // alpha formats, where NO alpha weighting is applied at all! 494 STBIR_BGRA_NO_AW = 12, // these are just synonyms for the _PM flags (which also do 495 STBIR_ARGB_NO_AW = 13, // no alpha weighting). These names just make it more clear 496 STBIR_ABGR_NO_AW = 14, // for some folks). 497 STBIR_RA_NO_AW = 15, 498 STBIR_AR_NO_AW = 16, 499 500 } stbir_pixel_layout; 501 502 //=============================================================== 503 // Simple-complexity API 504 // 505 // If output_pixels is NULL (0), then we will allocate the buffer and return it to you. 506 //-------------------------------- 507 508 STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 509 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 510 stbir_pixel_layout pixel_type ); 511 512 STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 513 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 514 stbir_pixel_layout pixel_type ); 515 516 STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 517 float *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 518 stbir_pixel_layout pixel_type ); 519 //=============================================================== 520 521 //=============================================================== 522 // Medium-complexity API 523 // 524 // This extends the easy-to-use API as follows: 525 // 526 // * Can specify the datatype - U8, U8_SRGB, U16, FLOAT, HALF_FLOAT 527 // * Edge wrap can selected explicitly 528 // * Filter can be selected explicitly 529 //-------------------------------- 530 531 typedef enum 532 { 533 STBIR_EDGE_CLAMP = 0, 534 STBIR_EDGE_REFLECT = 1, 535 STBIR_EDGE_WRAP = 2, // this edge mode is slower and uses more memory 536 STBIR_EDGE_ZERO = 3, 537 } stbir_edge; 538 539 typedef enum 540 { 541 STBIR_FILTER_DEFAULT = 0, // use same filter type that easy-to-use API chooses 542 STBIR_FILTER_BOX = 1, // A trapezoid w/1-pixel wide ramps, same result as box for integer scale ratios 543 STBIR_FILTER_TRIANGLE = 2, // On upsampling, produces same results as bilinear texture filtering 544 STBIR_FILTER_CUBICBSPLINE = 3, // The cubic b-spline (aka Mitchell-Netrevalli with B=1,C=0), gaussian-esque 545 STBIR_FILTER_CATMULLROM = 4, // An interpolating cubic spline 546 STBIR_FILTER_MITCHELL = 5, // Mitchell-Netrevalli filter with B=1/3, C=1/3 547 STBIR_FILTER_POINT_SAMPLE = 6, // Simple point sampling 548 STBIR_FILTER_OTHER = 7, // User callback specified 549 } stbir_filter; 550 551 typedef enum 552 { 553 STBIR_TYPE_UINT8 = 0, 554 STBIR_TYPE_UINT8_SRGB = 1, 555 STBIR_TYPE_UINT8_SRGB_ALPHA = 2, // alpha channel, when present, should also be SRGB (this is very unusual) 556 STBIR_TYPE_UINT16 = 3, 557 STBIR_TYPE_FLOAT = 4, 558 STBIR_TYPE_HALF_FLOAT = 5 559 } stbir_datatype; 560 561 // medium api 562 STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 563 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 564 stbir_pixel_layout pixel_layout, stbir_datatype data_type, 565 stbir_edge edge, stbir_filter filter ); 566 //=============================================================== 567 568 569 570 //=============================================================== 571 // Extended-complexity API 572 // 573 // This API exposes all resize functionality. 574 // 575 // * Separate filter types for each axis 576 // * Separate edge modes for each axis 577 // * Separate input and output data types 578 // * Can specify regions with subpixel correctness 579 // * Can specify alpha flags 580 // * Can specify a memory callback 581 // * Can specify a callback data type for pixel input and output 582 // * Can be threaded for a single resize 583 // * Can be used to resize many frames without recalculating the sampler info 584 // 585 // Use this API as follows: 586 // 1) Call the stbir_resize_init function on a local STBIR_RESIZE structure 587 // 2) Call any of the stbir_set functions 588 // 3) Optionally call stbir_build_samplers() if you are going to resample multiple times 589 // with the same input and output dimensions (like resizing video frames) 590 // 4) Resample by calling stbir_resize_extended(). 591 // 5) Call stbir_free_samplers() if you called stbir_build_samplers() 592 //-------------------------------- 593 594 595 // Types: 596 597 // INPUT CALLBACK: this callback is used for input scanlines 598 typedef void const * stbir_input_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context ); 599 600 // OUTPUT CALLBACK: this callback is used for output scanlines 601 typedef void stbir_output_callback( void const * output_ptr, int num_pixels, int y, void * context ); 602 603 // callbacks for user installed filters 604 typedef float stbir__kernel_callback( float x, float scale, void * user_data ); // centered at zero 605 typedef float stbir__support_callback( float scale, void * user_data ); 606 607 // internal structure with precomputed scaling 608 typedef struct stbir__info stbir__info; 609 610 typedef struct STBIR_RESIZE // use the stbir_resize_init and stbir_override functions to set these values for future compatibility 611 { 612 void * user_data; 613 void const * input_pixels; 614 int input_w, input_h; 615 double input_s0, input_t0, input_s1, input_t1; 616 stbir_input_callback * input_cb; 617 void * output_pixels; 618 int output_w, output_h; 619 int output_subx, output_suby, output_subw, output_subh; 620 stbir_output_callback * output_cb; 621 int input_stride_in_bytes; 622 int output_stride_in_bytes; 623 int splits; 624 int fast_alpha; 625 int needs_rebuild; 626 int called_alloc; 627 stbir_pixel_layout input_pixel_layout_public; 628 stbir_pixel_layout output_pixel_layout_public; 629 stbir_datatype input_data_type; 630 stbir_datatype output_data_type; 631 stbir_filter horizontal_filter, vertical_filter; 632 stbir_edge horizontal_edge, vertical_edge; 633 stbir__kernel_callback * horizontal_filter_kernel; stbir__support_callback * horizontal_filter_support; 634 stbir__kernel_callback * vertical_filter_kernel; stbir__support_callback * vertical_filter_support; 635 stbir__info * samplers; 636 } STBIR_RESIZE; 637 638 // extended complexity api 639 640 641 // First off, you must ALWAYS call stbir_resize_init on your resize structure before any of the other calls! 642 STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize, 643 const void *input_pixels, int input_w, int input_h, int input_stride_in_bytes, // stride can be zero 644 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero 645 stbir_pixel_layout pixel_layout, stbir_datatype data_type ); 646 647 //=============================================================== 648 // You can update these parameters any time after resize_init and there is no cost 649 //-------------------------------- 650 651 STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type ); 652 STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb ); // no callbacks by default 653 STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data ); // pass back STBIR_RESIZE* by default 654 STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes ); 655 656 //=============================================================== 657 658 659 //=============================================================== 660 // If you call any of these functions, you will trigger a sampler rebuild! 661 //-------------------------------- 662 663 STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout ); // sets new buffer layouts 664 STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge ); // CLAMP by default 665 666 STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default 667 STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support ); 668 669 STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets both sub-regions (full regions by default) 670 STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 ); // sets input sub-region (full region by default) 671 STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets output sub-region (full region by default) 672 673 // when inputting AND outputting non-premultiplied alpha pixels, we use a slower but higher quality technique 674 // that fills the zero alpha pixel's RGB values with something plausible. If you don't care about areas of 675 // zero alpha, you can call this function to get about a 25% speed improvement for STBIR_RGBA to STBIR_RGBA 676 // types of resizes. 677 STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality ); 678 //=============================================================== 679 680 681 //=============================================================== 682 // You can call build_samplers to prebuild all the internal data we need to resample. 683 // Then, if you call resize_extended many times with the same resize, you only pay the 684 // cost once. 685 // If you do call build_samplers, you MUST call free_samplers eventually. 686 //-------------------------------- 687 688 // This builds the samplers and does one allocation 689 STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize ); 690 691 // You MUST call this, if you call stbir_build_samplers or stbir_build_samplers_with_splits 692 STBIRDEF void stbir_free_samplers( STBIR_RESIZE * resize ); 693 //=============================================================== 694 695 696 // And this is the main function to perform the resize synchronously on one thread. 697 STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ); 698 699 700 //=============================================================== 701 // Use these functions for multithreading. 702 // 1) You call stbir_build_samplers_with_splits first on the main thread 703 // 2) Then stbir_resize_with_split on each thread 704 // 3) stbir_free_samplers when done on the main thread 705 //-------------------------------- 706 707 // This will build samplers for threading. 708 // You can pass in the number of threads you'd like to use (try_splits). 709 // It returns the number of splits (threads) that you can call it with. 710 /// It might be less if the image resize can't be split up that many ways. 711 712 STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits ); 713 714 // This function does a split of the resizing (you call this fuction for each 715 // split, on multiple threads). A split is a piece of the output resize pixel space. 716 717 // Note that you MUST call stbir_build_samplers_with_splits before stbir_resize_extended_split! 718 719 // Usually, you will always call stbir_resize_split with split_start as the thread_index 720 // and "1" for the split_count. 721 // But, if you have a weird situation where you MIGHT want 8 threads, but sometimes 722 // only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the 723 // split_count each time to turn in into a 4 thread resize. (This is unusual). 724 725 STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count ); 726 //=============================================================== 727 728 729 //=============================================================== 730 // Pixel Callbacks info: 731 //-------------------------------- 732 733 // The input callback is super flexible - it calls you with the input address 734 // (based on the stride and base pointer), it gives you an optional_output 735 // pointer that you can fill, or you can just return your own pointer into 736 // your own data. 737 // 738 // You can also do conversion from non-supported data types if necessary - in 739 // this case, you ignore the input_ptr and just use the x and y parameters to 740 // calculate your own input_ptr based on the size of each non-supported pixel. 741 // (Something like the third example below.) 742 // 743 // You can also install just an input or just an output callback by setting the 744 // callback that you don't want to zero. 745 // 746 // First example, progress: (getting a callback that you can monitor the progress): 747 // void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context ) 748 // { 749 // percentage_done = y / input_height; 750 // return input_ptr; // use buffer from call 751 // } 752 // 753 // Next example, copying: (copy from some other buffer or stream): 754 // void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context ) 755 // { 756 // CopyOrStreamData( optional_output, other_data_src, num_pixels * pixel_width_in_bytes ); 757 // return optional_output; // return the optional buffer that we filled 758 // } 759 // 760 // Third example, input another buffer without copying: (zero-copy from other buffer): 761 // void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context ) 762 // { 763 // void * pixels = ( (char*) other_image_base ) + ( y * other_image_stride ) + ( x * other_pixel_width_in_bytes ); 764 // return pixels; // return pointer to your data without copying 765 // } 766 // 767 // 768 // The output callback is considerably simpler - it just calls you so that you can dump 769 // out each scanline. You could even directly copy out to disk if you have a simple format 770 // like TGA or BMP. You can also convert to other output types here if you want. 771 // 772 // Simple example: 773 // void const * my_output( void * output_ptr, int num_pixels, int y, void * context ) 774 // { 775 // percentage_done = y / output_height; 776 // fwrite( output_ptr, pixel_width_in_bytes, num_pixels, output_file ); 777 // } 778 //=============================================================== 779 780 781 782 783 //=============================================================== 784 // optional built-in profiling API 785 //-------------------------------- 786 787 #ifdef STBIR_PROFILE 788 789 typedef struct STBIR_PROFILE_INFO 790 { 791 stbir_uint64 total_clocks; 792 793 // how many clocks spent (of total_clocks) in the various resize routines, along with a string description 794 // there are "resize_count" number of zones 795 stbir_uint64 clocks[ 8 ]; 796 char const ** descriptions; 797 798 // count of clocks and descriptions 799 stbir_uint32 count; 800 } STBIR_PROFILE_INFO; 801 802 // use after calling stbir_resize_extended (or stbir_build_samplers or stbir_build_samplers_with_splits) 803 STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize ); 804 805 // use after calling stbir_resize_extended 806 STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize ); 807 808 // use after calling stbir_resize_extended_split 809 STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize, int split_start, int split_num ); 810 811 //=============================================================== 812 813 #endif 814 815 816 //// end header file ///////////////////////////////////////////////////// 817 #endif // STBIR_INCLUDE_STB_IMAGE_RESIZE2_H 818 819 #if defined(STB_IMAGE_RESIZE_IMPLEMENTATION) || defined(STB_IMAGE_RESIZE2_IMPLEMENTATION) 820 821 #ifndef STBIR_ASSERT 822 #include <assert.h> 823 #define STBIR_ASSERT(x) assert(x) 824 #endif 825 826 #ifndef STBIR_MALLOC 827 #include <stdlib.h> 828 #define STBIR_MALLOC(size,user_data) ((void)(user_data), malloc(size)) 829 #define STBIR_FREE(ptr,user_data) ((void)(user_data), free(ptr)) 830 // (we used the comma operator to evaluate user_data, to avoid "unused parameter" warnings) 831 #endif 832 833 #ifdef _MSC_VER 834 835 #define stbir__inline __forceinline 836 837 #else 838 839 #define stbir__inline __inline__ 840 841 // Clang address sanitizer 842 #if defined(__has_feature) 843 #if __has_feature(address_sanitizer) || __has_feature(memory_sanitizer) 844 #ifndef STBIR__SEPARATE_ALLOCATIONS 845 #define STBIR__SEPARATE_ALLOCATIONS 846 #endif 847 #endif 848 #endif 849 850 #endif 851 852 // GCC and MSVC 853 #if defined(__SANITIZE_ADDRESS__) 854 #ifndef STBIR__SEPARATE_ALLOCATIONS 855 #define STBIR__SEPARATE_ALLOCATIONS 856 #endif 857 #endif 858 859 // Always turn off automatic FMA use - use STBIR_USE_FMA if you want. 860 // Otherwise, this is a determinism disaster. 861 #ifndef STBIR_DONT_CHANGE_FP_CONTRACT // override in case you don't want this behavior 862 #if defined(_MSC_VER) && !defined(__clang__) 863 #if _MSC_VER > 1200 864 #pragma fp_contract(off) 865 #endif 866 #elif defined(__GNUC__) && !defined(__clang__) 867 #pragma GCC optimize("fp-contract=off") 868 #else 869 #pragma STDC FP_CONTRACT OFF 870 #endif 871 #endif 872 873 #ifdef _MSC_VER 874 #define STBIR__UNUSED(v) (void)(v) 875 #else 876 #define STBIR__UNUSED(v) (void)sizeof(v) 877 #endif 878 879 #define STBIR__ARRAY_SIZE(a) (sizeof((a))/sizeof((a)[0])) 880 881 882 #ifndef STBIR_DEFAULT_FILTER_UPSAMPLE 883 #define STBIR_DEFAULT_FILTER_UPSAMPLE STBIR_FILTER_CATMULLROM 884 #endif 885 886 #ifndef STBIR_DEFAULT_FILTER_DOWNSAMPLE 887 #define STBIR_DEFAULT_FILTER_DOWNSAMPLE STBIR_FILTER_MITCHELL 888 #endif 889 890 891 #ifndef STBIR__HEADER_FILENAME 892 #define STBIR__HEADER_FILENAME "stb_image_resize2.h" 893 #endif 894 895 // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types 896 // the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible 897 typedef enum 898 { 899 STBIRI_1CHANNEL = 0, 900 STBIRI_2CHANNEL = 1, 901 STBIRI_RGB = 2, 902 STBIRI_BGR = 3, 903 STBIRI_4CHANNEL = 4, 904 905 STBIRI_RGBA = 5, 906 STBIRI_BGRA = 6, 907 STBIRI_ARGB = 7, 908 STBIRI_ABGR = 8, 909 STBIRI_RA = 9, 910 STBIRI_AR = 10, 911 912 STBIRI_RGBA_PM = 11, 913 STBIRI_BGRA_PM = 12, 914 STBIRI_ARGB_PM = 13, 915 STBIRI_ABGR_PM = 14, 916 STBIRI_RA_PM = 15, 917 STBIRI_AR_PM = 16, 918 } stbir_internal_pixel_layout; 919 920 // define the public pixel layouts to not compile inside the implementation (to avoid accidental use) 921 #define STBIR_BGR bad_dont_use_in_implementation 922 #define STBIR_1CHANNEL STBIR_BGR 923 #define STBIR_2CHANNEL STBIR_BGR 924 #define STBIR_RGB STBIR_BGR 925 #define STBIR_RGBA STBIR_BGR 926 #define STBIR_4CHANNEL STBIR_BGR 927 #define STBIR_BGRA STBIR_BGR 928 #define STBIR_ARGB STBIR_BGR 929 #define STBIR_ABGR STBIR_BGR 930 #define STBIR_RA STBIR_BGR 931 #define STBIR_AR STBIR_BGR 932 #define STBIR_RGBA_PM STBIR_BGR 933 #define STBIR_BGRA_PM STBIR_BGR 934 #define STBIR_ARGB_PM STBIR_BGR 935 #define STBIR_ABGR_PM STBIR_BGR 936 #define STBIR_RA_PM STBIR_BGR 937 #define STBIR_AR_PM STBIR_BGR 938 939 // must match stbir_datatype 940 static unsigned char stbir__type_size[] = { 941 1,1,1,2,4,2 // STBIR_TYPE_UINT8,STBIR_TYPE_UINT8_SRGB,STBIR_TYPE_UINT8_SRGB_ALPHA,STBIR_TYPE_UINT16,STBIR_TYPE_FLOAT,STBIR_TYPE_HALF_FLOAT 942 }; 943 944 // When gathering, the contributors are which source pixels contribute. 945 // When scattering, the contributors are which destination pixels are contributed to. 946 typedef struct 947 { 948 int n0; // First contributing pixel 949 int n1; // Last contributing pixel 950 } stbir__contributors; 951 952 typedef struct 953 { 954 int lowest; // First sample index for whole filter 955 int highest; // Last sample index for whole filter 956 int widest; // widest single set of samples for an output 957 } stbir__filter_extent_info; 958 959 typedef struct 960 { 961 int n0; // First pixel of decode buffer to write to 962 int n1; // Last pixel of decode that will be written to 963 int pixel_offset_for_input; // Pixel offset into input_scanline 964 } stbir__span; 965 966 typedef struct stbir__scale_info 967 { 968 int input_full_size; 969 int output_sub_size; 970 float scale; 971 float inv_scale; 972 float pixel_shift; // starting shift in output pixel space (in pixels) 973 int scale_is_rational; 974 stbir_uint32 scale_numerator, scale_denominator; 975 } stbir__scale_info; 976 977 typedef struct 978 { 979 stbir__contributors * contributors; 980 float* coefficients; 981 stbir__contributors * gather_prescatter_contributors; 982 float * gather_prescatter_coefficients; 983 stbir__scale_info scale_info; 984 float support; 985 stbir_filter filter_enum; 986 stbir__kernel_callback * filter_kernel; 987 stbir__support_callback * filter_support; 988 stbir_edge edge; 989 int coefficient_width; 990 int filter_pixel_width; 991 int filter_pixel_margin; 992 int num_contributors; 993 int contributors_size; 994 int coefficients_size; 995 stbir__filter_extent_info extent_info; 996 int is_gather; // 0 = scatter, 1 = gather with scale >= 1, 2 = gather with scale < 1 997 int gather_prescatter_num_contributors; 998 int gather_prescatter_coefficient_width; 999 int gather_prescatter_contributors_size; 1000 int gather_prescatter_coefficients_size; 1001 } stbir__sampler; 1002 1003 typedef struct 1004 { 1005 stbir__contributors conservative; 1006 int edge_sizes[2]; // this can be less than filter_pixel_margin, if the filter and scaling falls off 1007 stbir__span spans[2]; // can be two spans, if doing input subrect with clamp mode WRAP 1008 } stbir__extents; 1009 1010 typedef struct 1011 { 1012 #ifdef STBIR_PROFILE 1013 union 1014 { 1015 struct { stbir_uint64 total, looping, vertical, horizontal, decode, encode, alpha, unalpha; } named; 1016 stbir_uint64 array[8]; 1017 } profile; 1018 stbir_uint64 * current_zone_excluded_ptr; 1019 #endif 1020 float* decode_buffer; 1021 1022 int ring_buffer_first_scanline; 1023 int ring_buffer_last_scanline; 1024 int ring_buffer_begin_index; // first_scanline is at this index in the ring buffer 1025 int start_output_y, end_output_y; 1026 int start_input_y, end_input_y; // used in scatter only 1027 1028 #ifdef STBIR__SEPARATE_ALLOCATIONS 1029 float** ring_buffers; // one pointer for each ring buffer 1030 #else 1031 float* ring_buffer; // one big buffer that we index into 1032 #endif 1033 1034 float* vertical_buffer; 1035 1036 char no_cache_straddle[64]; 1037 } stbir__per_split_info; 1038 1039 typedef void stbir__decode_pixels_func( float * decode, int width_times_channels, void const * input ); 1040 typedef void stbir__alpha_weight_func( float * decode_buffer, int width_times_channels ); 1041 typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, 1042 stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ); 1043 typedef void stbir__alpha_unweight_func(float * encode_buffer, int width_times_channels ); 1044 typedef void stbir__encode_pixels_func( void * output, int width_times_channels, float const * encode ); 1045 1046 struct stbir__info 1047 { 1048 #ifdef STBIR_PROFILE 1049 union 1050 { 1051 struct { stbir_uint64 total, build, alloc, horizontal, vertical, cleanup, pivot; } named; 1052 stbir_uint64 array[7]; 1053 } profile; 1054 stbir_uint64 * current_zone_excluded_ptr; 1055 #endif 1056 stbir__sampler horizontal; 1057 stbir__sampler vertical; 1058 1059 void const * input_data; 1060 void * output_data; 1061 1062 int input_stride_bytes; 1063 int output_stride_bytes; 1064 int ring_buffer_length_bytes; // The length of an individual entry in the ring buffer. The total number of ring buffers is stbir__get_filter_pixel_width(filter) 1065 int ring_buffer_num_entries; // Total number of entries in the ring buffer. 1066 1067 stbir_datatype input_type; 1068 stbir_datatype output_type; 1069 1070 stbir_input_callback * in_pixels_cb; 1071 void * user_data; 1072 stbir_output_callback * out_pixels_cb; 1073 1074 stbir__extents scanline_extents; 1075 1076 void * alloced_mem; 1077 stbir__per_split_info * split_info; // by default 1, but there will be N of these allocated based on the thread init you did 1078 1079 stbir__decode_pixels_func * decode_pixels; 1080 stbir__alpha_weight_func * alpha_weight; 1081 stbir__horizontal_gather_channels_func * horizontal_gather_channels; 1082 stbir__alpha_unweight_func * alpha_unweight; 1083 stbir__encode_pixels_func * encode_pixels; 1084 1085 int alloc_ring_buffer_num_entries; // Number of entries in the ring buffer that will be allocated 1086 int splits; // count of splits 1087 1088 stbir_internal_pixel_layout input_pixel_layout_internal; 1089 stbir_internal_pixel_layout output_pixel_layout_internal; 1090 1091 int input_color_and_type; 1092 int offset_x, offset_y; // offset within output_data 1093 int vertical_first; 1094 int channels; 1095 int effective_channels; // same as channels, except on RGBA/ARGB (7), or XA/AX (3) 1096 size_t alloced_total; 1097 }; 1098 1099 1100 #define stbir__max_uint8_as_float 255.0f 1101 #define stbir__max_uint16_as_float 65535.0f 1102 #define stbir__max_uint8_as_float_inverted (1.0f/255.0f) 1103 #define stbir__max_uint16_as_float_inverted (1.0f/65535.0f) 1104 #define stbir__small_float ((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20)) 1105 1106 // min/max friendly 1107 #define STBIR_CLAMP(x, xmin, xmax) for(;;) { \ 1108 if ( (x) < (xmin) ) (x) = (xmin); \ 1109 if ( (x) > (xmax) ) (x) = (xmax); \ 1110 break; \ 1111 } 1112 1113 static stbir__inline int stbir__min(int a, int b) 1114 { 1115 return a < b ? a : b; 1116 } 1117 1118 static stbir__inline int stbir__max(int a, int b) 1119 { 1120 return a > b ? a : b; 1121 } 1122 1123 static float stbir__srgb_uchar_to_linear_float[256] = { 1124 0.000000f, 0.000304f, 0.000607f, 0.000911f, 0.001214f, 0.001518f, 0.001821f, 0.002125f, 0.002428f, 0.002732f, 0.003035f, 1125 0.003347f, 0.003677f, 0.004025f, 0.004391f, 0.004777f, 0.005182f, 0.005605f, 0.006049f, 0.006512f, 0.006995f, 0.007499f, 1126 0.008023f, 0.008568f, 0.009134f, 0.009721f, 0.010330f, 0.010960f, 0.011612f, 0.012286f, 0.012983f, 0.013702f, 0.014444f, 1127 0.015209f, 0.015996f, 0.016807f, 0.017642f, 0.018500f, 0.019382f, 0.020289f, 0.021219f, 0.022174f, 0.023153f, 0.024158f, 1128 0.025187f, 0.026241f, 0.027321f, 0.028426f, 0.029557f, 0.030713f, 0.031896f, 0.033105f, 0.034340f, 0.035601f, 0.036889f, 1129 0.038204f, 0.039546f, 0.040915f, 0.042311f, 0.043735f, 0.045186f, 0.046665f, 0.048172f, 0.049707f, 0.051269f, 0.052861f, 1130 0.054480f, 0.056128f, 0.057805f, 0.059511f, 0.061246f, 0.063010f, 0.064803f, 0.066626f, 0.068478f, 0.070360f, 0.072272f, 1131 0.074214f, 0.076185f, 0.078187f, 0.080220f, 0.082283f, 0.084376f, 0.086500f, 0.088656f, 0.090842f, 0.093059f, 0.095307f, 1132 0.097587f, 0.099899f, 0.102242f, 0.104616f, 0.107023f, 0.109462f, 0.111932f, 0.114435f, 0.116971f, 0.119538f, 0.122139f, 1133 0.124772f, 0.127438f, 0.130136f, 0.132868f, 0.135633f, 0.138432f, 0.141263f, 0.144128f, 0.147027f, 0.149960f, 0.152926f, 1134 0.155926f, 0.158961f, 0.162029f, 0.165132f, 0.168269f, 0.171441f, 0.174647f, 0.177888f, 0.181164f, 0.184475f, 0.187821f, 1135 0.191202f, 0.194618f, 0.198069f, 0.201556f, 0.205079f, 0.208637f, 0.212231f, 0.215861f, 0.219526f, 0.223228f, 0.226966f, 1136 0.230740f, 0.234551f, 0.238398f, 0.242281f, 0.246201f, 0.250158f, 0.254152f, 0.258183f, 0.262251f, 0.266356f, 0.270498f, 1137 0.274677f, 0.278894f, 0.283149f, 0.287441f, 0.291771f, 0.296138f, 0.300544f, 0.304987f, 0.309469f, 0.313989f, 0.318547f, 1138 0.323143f, 0.327778f, 0.332452f, 0.337164f, 0.341914f, 0.346704f, 0.351533f, 0.356400f, 0.361307f, 0.366253f, 0.371238f, 1139 0.376262f, 0.381326f, 0.386430f, 0.391573f, 0.396755f, 0.401978f, 0.407240f, 0.412543f, 0.417885f, 0.423268f, 0.428691f, 1140 0.434154f, 0.439657f, 0.445201f, 0.450786f, 0.456411f, 0.462077f, 0.467784f, 0.473532f, 0.479320f, 0.485150f, 0.491021f, 1141 0.496933f, 0.502887f, 0.508881f, 0.514918f, 0.520996f, 0.527115f, 0.533276f, 0.539480f, 0.545725f, 0.552011f, 0.558340f, 1142 0.564712f, 0.571125f, 0.577581f, 0.584078f, 0.590619f, 0.597202f, 0.603827f, 0.610496f, 0.617207f, 0.623960f, 0.630757f, 1143 0.637597f, 0.644480f, 0.651406f, 0.658375f, 0.665387f, 0.672443f, 0.679543f, 0.686685f, 0.693872f, 0.701102f, 0.708376f, 1144 0.715694f, 0.723055f, 0.730461f, 0.737911f, 0.745404f, 0.752942f, 0.760525f, 0.768151f, 0.775822f, 0.783538f, 0.791298f, 1145 0.799103f, 0.806952f, 0.814847f, 0.822786f, 0.830770f, 0.838799f, 0.846873f, 0.854993f, 0.863157f, 0.871367f, 0.879622f, 1146 0.887923f, 0.896269f, 0.904661f, 0.913099f, 0.921582f, 0.930111f, 0.938686f, 0.947307f, 0.955974f, 0.964686f, 0.973445f, 1147 0.982251f, 0.991102f, 1.0f 1148 }; 1149 1150 typedef union 1151 { 1152 unsigned int u; 1153 float f; 1154 } stbir__FP32; 1155 1156 // From https://gist.github.com/rygorous/2203834 1157 1158 static const stbir_uint32 fp32_to_srgb8_tab4[104] = { 1159 0x0073000d, 0x007a000d, 0x0080000d, 0x0087000d, 0x008d000d, 0x0094000d, 0x009a000d, 0x00a1000d, 1160 0x00a7001a, 0x00b4001a, 0x00c1001a, 0x00ce001a, 0x00da001a, 0x00e7001a, 0x00f4001a, 0x0101001a, 1161 0x010e0033, 0x01280033, 0x01410033, 0x015b0033, 0x01750033, 0x018f0033, 0x01a80033, 0x01c20033, 1162 0x01dc0067, 0x020f0067, 0x02430067, 0x02760067, 0x02aa0067, 0x02dd0067, 0x03110067, 0x03440067, 1163 0x037800ce, 0x03df00ce, 0x044600ce, 0x04ad00ce, 0x051400ce, 0x057b00c5, 0x05dd00bc, 0x063b00b5, 1164 0x06970158, 0x07420142, 0x07e30130, 0x087b0120, 0x090b0112, 0x09940106, 0x0a1700fc, 0x0a9500f2, 1165 0x0b0f01cb, 0x0bf401ae, 0x0ccb0195, 0x0d950180, 0x0e56016e, 0x0f0d015e, 0x0fbc0150, 0x10630143, 1166 0x11070264, 0x1238023e, 0x1357021d, 0x14660201, 0x156601e9, 0x165a01d3, 0x174401c0, 0x182401af, 1167 0x18fe0331, 0x1a9602fe, 0x1c1502d2, 0x1d7e02ad, 0x1ed4028d, 0x201a0270, 0x21520256, 0x227d0240, 1168 0x239f0443, 0x25c003fe, 0x27bf03c4, 0x29a10392, 0x2b6a0367, 0x2d1d0341, 0x2ebe031f, 0x304d0300, 1169 0x31d105b0, 0x34a80555, 0x37520507, 0x39d504c5, 0x3c37048b, 0x3e7c0458, 0x40a8042a, 0x42bd0401, 1170 0x44c20798, 0x488e071e, 0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559, 1171 0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd, 0x787d076c, 0x7c330723, 1172 }; 1173 1174 static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) 1175 { 1176 static const stbir__FP32 almostone = { 0x3f7fffff }; // 1-eps 1177 static const stbir__FP32 minval = { (127-13) << 23 }; 1178 stbir_uint32 tab,bias,scale,t; 1179 stbir__FP32 f; 1180 1181 // Clamp to [2^(-13), 1-eps]; these two values map to 0 and 1, respectively. 1182 // The tests are carefully written so that NaNs map to 0, same as in the reference 1183 // implementation. 1184 if (!(in > minval.f)) // written this way to catch NaNs 1185 return 0; 1186 if (in > almostone.f) 1187 return 255; 1188 1189 // Do the table lookup and unpack bias, scale 1190 f.f = in; 1191 tab = fp32_to_srgb8_tab4[(f.u - minval.u) >> 20]; 1192 bias = (tab >> 16) << 9; 1193 scale = tab & 0xffff; 1194 1195 // Grab next-highest mantissa bits and perform linear interpolation 1196 t = (f.u >> 12) & 0xff; 1197 return (unsigned char) ((bias + scale*t) >> 16); 1198 } 1199 1200 #ifndef STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 1201 #define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 32 // when downsampling and <= 32 scanlines of buffering, use gather. gather used down to 1/8th scaling for 25% win. 1202 #endif 1203 1204 #ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 1205 #define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 4 // when threading, what is the minimum number of scanlines for a split? 1206 #endif 1207 1208 // restrict pointers for the output pointers, other loop and unroll control 1209 #if defined( _MSC_VER ) && !defined(__clang__) 1210 #define STBIR_STREAMOUT_PTR( star ) star __restrict 1211 #define STBIR_NO_UNROLL( ptr ) __assume(ptr) // this oddly keeps msvc from unrolling a loop 1212 #if _MSC_VER >= 1900 1213 #define STBIR_NO_UNROLL_LOOP_START __pragma(loop( no_vector )) 1214 #else 1215 #define STBIR_NO_UNROLL_LOOP_START 1216 #endif 1217 #elif defined( __clang__ ) 1218 #define STBIR_STREAMOUT_PTR( star ) star __restrict__ 1219 #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) 1220 #if ( __clang_major__ >= 4 ) || ( ( __clang_major__ >= 3 ) && ( __clang_minor__ >= 5 ) ) 1221 #define STBIR_NO_UNROLL_LOOP_START _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)") 1222 #else 1223 #define STBIR_NO_UNROLL_LOOP_START 1224 #endif 1225 #elif defined( __GNUC__ ) 1226 #define STBIR_STREAMOUT_PTR( star ) star __restrict__ 1227 #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) 1228 #if __GNUC__ >= 14 1229 #define STBIR_NO_UNROLL_LOOP_START _Pragma("GCC unroll 0") _Pragma("GCC novector") 1230 #else 1231 #define STBIR_NO_UNROLL_LOOP_START 1232 #endif 1233 #define STBIR_NO_UNROLL_LOOP_START_INF_FOR 1234 #else 1235 #define STBIR_STREAMOUT_PTR( star ) star 1236 #define STBIR_NO_UNROLL( ptr ) 1237 #define STBIR_NO_UNROLL_LOOP_START 1238 #endif 1239 1240 #ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR 1241 #define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START 1242 #endif 1243 1244 #ifdef STBIR_NO_SIMD // force simd off for whatever reason 1245 1246 // force simd off overrides everything else, so clear it all 1247 1248 #ifdef STBIR_SSE2 1249 #undef STBIR_SSE2 1250 #endif 1251 1252 #ifdef STBIR_AVX 1253 #undef STBIR_AVX 1254 #endif 1255 1256 #ifdef STBIR_NEON 1257 #undef STBIR_NEON 1258 #endif 1259 1260 #ifdef STBIR_AVX2 1261 #undef STBIR_AVX2 1262 #endif 1263 1264 #ifdef STBIR_FP16C 1265 #undef STBIR_FP16C 1266 #endif 1267 1268 #ifdef STBIR_WASM 1269 #undef STBIR_WASM 1270 #endif 1271 1272 #ifdef STBIR_SIMD 1273 #undef STBIR_SIMD 1274 #endif 1275 1276 #else // STBIR_SIMD 1277 1278 #ifdef STBIR_SSE2 1279 #include <emmintrin.h> 1280 1281 #define stbir__simdf __m128 1282 #define stbir__simdi __m128i 1283 1284 #define stbir_simdi_castf( reg ) _mm_castps_si128(reg) 1285 #define stbir_simdf_casti( reg ) _mm_castsi128_ps(reg) 1286 1287 #define stbir__simdf_load( reg, ptr ) (reg) = _mm_loadu_ps( (float const*)(ptr) ) 1288 #define stbir__simdi_load( reg, ptr ) (reg) = _mm_loadu_si128 ( (stbir__simdi const*)(ptr) ) 1289 #define stbir__simdf_load1( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) ) // top values can be random (not denormal or nan for perf) 1290 #define stbir__simdi_load1( out, ptr ) (out) = _mm_castps_si128( _mm_load_ss( (float const*)(ptr) )) 1291 #define stbir__simdf_load1z( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) ) // top values must be zero 1292 #define stbir__simdf_frep4( fvar ) _mm_set_ps1( fvar ) 1293 #define stbir__simdf_load1frep4( out, fvar ) (out) = _mm_set_ps1( fvar ) 1294 #define stbir__simdf_load2( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values can be random (not denormal or nan for perf) 1295 #define stbir__simdf_load2z( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values must be zero 1296 #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = _mm_castpd_ps(_mm_loadh_pd( _mm_castps_pd(reg), (double*)(ptr) )) 1297 1298 #define stbir__simdf_zeroP() _mm_setzero_ps() 1299 #define stbir__simdf_zero( reg ) (reg) = _mm_setzero_ps() 1300 1301 #define stbir__simdf_store( ptr, reg ) _mm_storeu_ps( (float*)(ptr), reg ) 1302 #define stbir__simdf_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), reg ) 1303 #define stbir__simdf_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), _mm_castps_si128(reg) ) 1304 #define stbir__simdf_store2h( ptr, reg ) _mm_storeh_pd( (double*)(ptr), _mm_castps_pd(reg) ) 1305 1306 #define stbir__simdi_store( ptr, reg ) _mm_storeu_si128( (__m128i*)(ptr), reg ) 1307 #define stbir__simdi_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), _mm_castsi128_ps(reg) ) 1308 #define stbir__simdi_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), (reg) ) 1309 1310 #define stbir__prefetch( ptr ) _mm_prefetch((char*)(ptr), _MM_HINT_T0 ) 1311 1312 #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \ 1313 { \ 1314 stbir__simdi zero = _mm_setzero_si128(); \ 1315 out2 = _mm_unpacklo_epi8( ireg, zero ); \ 1316 out3 = _mm_unpackhi_epi8( ireg, zero ); \ 1317 out0 = _mm_unpacklo_epi16( out2, zero ); \ 1318 out1 = _mm_unpackhi_epi16( out2, zero ); \ 1319 out2 = _mm_unpacklo_epi16( out3, zero ); \ 1320 out3 = _mm_unpackhi_epi16( out3, zero ); \ 1321 } 1322 1323 #define stbir__simdi_expand_u8_to_1u32(out,ireg) \ 1324 { \ 1325 stbir__simdi zero = _mm_setzero_si128(); \ 1326 out = _mm_unpacklo_epi8( ireg, zero ); \ 1327 out = _mm_unpacklo_epi16( out, zero ); \ 1328 } 1329 1330 #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \ 1331 { \ 1332 stbir__simdi zero = _mm_setzero_si128(); \ 1333 out0 = _mm_unpacklo_epi16( ireg, zero ); \ 1334 out1 = _mm_unpackhi_epi16( ireg, zero ); \ 1335 } 1336 1337 #define stbir__simdf_convert_float_to_i32( i, f ) (i) = _mm_cvttps_epi32(f) 1338 #define stbir__simdf_convert_float_to_int( f ) _mm_cvtt_ss2si(f) 1339 #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),_mm_setzero_ps())))) 1340 #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())))) 1341 1342 #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i) 1343 #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = _mm_cvtepi32_ps( ireg ) 1344 #define stbir__simdf_add( out, reg0, reg1 ) (out) = _mm_add_ps( reg0, reg1 ) 1345 #define stbir__simdf_mult( out, reg0, reg1 ) (out) = _mm_mul_ps( reg0, reg1 ) 1346 #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = _mm_mul_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) ) 1347 #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = _mm_mul_ss( reg, _mm_load_ss( (float const*)(ptr) ) ) 1348 #define stbir__simdf_add_mem( out, reg, ptr ) (out) = _mm_add_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) ) 1349 #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = _mm_add_ss( reg, _mm_load_ss( (float const*)(ptr) ) ) 1350 1351 #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd 1352 #include <immintrin.h> 1353 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_fmadd_ps( mul1, mul2, add ) 1354 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_fmadd_ss( mul1, mul2, add ) 1355 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ps( mul, _mm_loadu_ps( (float const*)(ptr) ), add ) 1356 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ss( mul, _mm_load_ss( (float const*)(ptr) ), add ) 1357 #else 1358 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_add_ps( add, _mm_mul_ps( mul1, mul2 ) ) 1359 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_add_ss( add, _mm_mul_ss( mul1, mul2 ) ) 1360 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_add_ps( add, _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ) ) 1361 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_add_ss( add, _mm_mul_ss( mul, _mm_load_ss( (float const*)(ptr) ) ) ) 1362 #endif 1363 1364 #define stbir__simdf_add1( out, reg0, reg1 ) (out) = _mm_add_ss( reg0, reg1 ) 1365 #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = _mm_mul_ss( reg0, reg1 ) 1366 1367 #define stbir__simdf_and( out, reg0, reg1 ) (out) = _mm_and_ps( reg0, reg1 ) 1368 #define stbir__simdf_or( out, reg0, reg1 ) (out) = _mm_or_ps( reg0, reg1 ) 1369 1370 #define stbir__simdf_min( out, reg0, reg1 ) (out) = _mm_min_ps( reg0, reg1 ) 1371 #define stbir__simdf_max( out, reg0, reg1 ) (out) = _mm_max_ps( reg0, reg1 ) 1372 #define stbir__simdf_min1( out, reg0, reg1 ) (out) = _mm_min_ss( reg0, reg1 ) 1373 #define stbir__simdf_max1( out, reg0, reg1 ) (out) = _mm_max_ss( reg0, reg1 ) 1374 1375 #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (3<<0) + (0<<2) + (1<<4) + (2<<6) ) ) 1376 #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (2<<0) + (3<<2) + (0<<4) + (1<<6) ) ) 1377 1378 static const stbir__simdf STBIR_zeroones = { 0.0f,1.0f,0.0f,1.0f }; 1379 static const stbir__simdf STBIR_onezeros = { 1.0f,0.0f,1.0f,0.0f }; 1380 #define stbir__simdf_aaa1( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movehl_ps( ones, alp ) ), (1<<0) + (1<<2) + (1<<4) + (2<<6) ) ) 1381 #define stbir__simdf_1aaa( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movelh_ps( ones, alp ) ), (0<<0) + (2<<2) + (2<<4) + (2<<6) ) ) 1382 #define stbir__simdf_a1a1( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_srli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_zeroones ) 1383 #define stbir__simdf_1a1a( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_slli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_onezeros ) 1384 1385 #define stbir__simdf_swiz( reg, one, two, three, four ) _mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( reg ), (one<<0) + (two<<2) + (three<<4) + (four<<6) ) ) 1386 1387 #define stbir__simdi_and( out, reg0, reg1 ) (out) = _mm_and_si128( reg0, reg1 ) 1388 #define stbir__simdi_or( out, reg0, reg1 ) (out) = _mm_or_si128( reg0, reg1 ) 1389 #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = _mm_madd_epi16( reg0, reg1 ) 1390 1391 #define stbir__simdf_pack_to_8bytes(out,aa,bb) \ 1392 { \ 1393 stbir__simdf af,bf; \ 1394 stbir__simdi a,b; \ 1395 af = _mm_min_ps( aa, STBIR_max_uint8_as_float ); \ 1396 bf = _mm_min_ps( bb, STBIR_max_uint8_as_float ); \ 1397 af = _mm_max_ps( af, _mm_setzero_ps() ); \ 1398 bf = _mm_max_ps( bf, _mm_setzero_ps() ); \ 1399 a = _mm_cvttps_epi32( af ); \ 1400 b = _mm_cvttps_epi32( bf ); \ 1401 a = _mm_packs_epi32( a, b ); \ 1402 out = _mm_packus_epi16( a, a ); \ 1403 } 1404 1405 #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \ 1406 stbir__simdf_load( o0, (ptr) ); \ 1407 stbir__simdf_load( o1, (ptr)+4 ); \ 1408 stbir__simdf_load( o2, (ptr)+8 ); \ 1409 stbir__simdf_load( o3, (ptr)+12 ); \ 1410 { \ 1411 __m128 tmp0, tmp1, tmp2, tmp3; \ 1412 tmp0 = _mm_unpacklo_ps(o0, o1); \ 1413 tmp2 = _mm_unpacklo_ps(o2, o3); \ 1414 tmp1 = _mm_unpackhi_ps(o0, o1); \ 1415 tmp3 = _mm_unpackhi_ps(o2, o3); \ 1416 o0 = _mm_movelh_ps(tmp0, tmp2); \ 1417 o1 = _mm_movehl_ps(tmp2, tmp0); \ 1418 o2 = _mm_movelh_ps(tmp1, tmp3); \ 1419 o3 = _mm_movehl_ps(tmp3, tmp1); \ 1420 } 1421 1422 #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \ 1423 r0 = _mm_packs_epi32( r0, r1 ); \ 1424 r2 = _mm_packs_epi32( r2, r3 ); \ 1425 r1 = _mm_unpacklo_epi16( r0, r2 ); \ 1426 r3 = _mm_unpackhi_epi16( r0, r2 ); \ 1427 r0 = _mm_unpacklo_epi16( r1, r3 ); \ 1428 r2 = _mm_unpackhi_epi16( r1, r3 ); \ 1429 r0 = _mm_packus_epi16( r0, r2 ); \ 1430 stbir__simdi_store( ptr, r0 ); \ 1431 1432 #define stbir__simdi_32shr( out, reg, imm ) out = _mm_srli_epi32( reg, imm ) 1433 1434 #if defined(_MSC_VER) && !defined(__clang__) 1435 // msvc inits with 8 bytes 1436 #define STBIR__CONST_32_TO_8( v ) (char)(unsigned char)((v)&255),(char)(unsigned char)(((v)>>8)&255),(char)(unsigned char)(((v)>>16)&255),(char)(unsigned char)(((v)>>24)&255) 1437 #define STBIR__CONST_4_32i( v ) STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ) 1438 #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) STBIR__CONST_32_TO_8( v0 ), STBIR__CONST_32_TO_8( v1 ), STBIR__CONST_32_TO_8( v2 ), STBIR__CONST_32_TO_8( v3 ) 1439 #else 1440 // everything else inits with long long's 1441 #define STBIR__CONST_4_32i( v ) (long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v))),(long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v))) 1442 #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) (long long)((((stbir_uint64)(stbir_uint32)(v1))<<32)|((stbir_uint64)(stbir_uint32)(v0))),(long long)((((stbir_uint64)(stbir_uint32)(v3))<<32)|((stbir_uint64)(stbir_uint32)(v2))) 1443 #endif 1444 1445 #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x } 1446 #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { STBIR__CONST_4_32i(x) } 1447 #define STBIR__CONSTF(var) (var) 1448 #define STBIR__CONSTI(var) (var) 1449 1450 #if defined(STBIR_AVX) || defined(__SSE4_1__) 1451 #include <smmintrin.h> 1452 #define stbir__simdf_pack_to_8words(out,reg0,reg1) out = _mm_packus_epi32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())), _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps()))) 1453 #else 1454 STBIR__SIMDI_CONST(stbir__s32_32768, 32768); 1455 STBIR__SIMDI_CONST(stbir__s16_32768, ((32768<<16)|32768)); 1456 1457 #define stbir__simdf_pack_to_8words(out,reg0,reg1) \ 1458 { \ 1459 stbir__simdi tmp0,tmp1; \ 1460 tmp0 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \ 1461 tmp1 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \ 1462 tmp0 = _mm_sub_epi32( tmp0, stbir__s32_32768 ); \ 1463 tmp1 = _mm_sub_epi32( tmp1, stbir__s32_32768 ); \ 1464 out = _mm_packs_epi32( tmp0, tmp1 ); \ 1465 out = _mm_sub_epi16( out, stbir__s16_32768 ); \ 1466 } 1467 1468 #endif 1469 1470 #define STBIR_SIMD 1471 1472 // if we detect AVX, set the simd8 defines 1473 #ifdef STBIR_AVX 1474 #include <immintrin.h> 1475 #define STBIR_SIMD8 1476 #define stbir__simdf8 __m256 1477 #define stbir__simdi8 __m256i 1478 #define stbir__simdf8_load( out, ptr ) (out) = _mm256_loadu_ps( (float const *)(ptr) ) 1479 #define stbir__simdi8_load( out, ptr ) (out) = _mm256_loadu_si256( (__m256i const *)(ptr) ) 1480 #define stbir__simdf8_mult( out, a, b ) (out) = _mm256_mul_ps( (a), (b) ) 1481 #define stbir__simdf8_store( ptr, out ) _mm256_storeu_ps( (float*)(ptr), out ) 1482 #define stbir__simdi8_store( ptr, reg ) _mm256_storeu_si256( (__m256i*)(ptr), reg ) 1483 #define stbir__simdf8_frep8( fval ) _mm256_set1_ps( fval ) 1484 1485 #define stbir__simdf8_min( out, reg0, reg1 ) (out) = _mm256_min_ps( reg0, reg1 ) 1486 #define stbir__simdf8_max( out, reg0, reg1 ) (out) = _mm256_max_ps( reg0, reg1 ) 1487 1488 #define stbir__simdf8_add4halves( out, bot4, top8 ) (out) = _mm_add_ps( bot4, _mm256_extractf128_ps( top8, 1 ) ) 1489 #define stbir__simdf8_mult_mem( out, reg, ptr ) (out) = _mm256_mul_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) ) 1490 #define stbir__simdf8_add_mem( out, reg, ptr ) (out) = _mm256_add_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) ) 1491 #define stbir__simdf8_add( out, a, b ) (out) = _mm256_add_ps( a, b ) 1492 #define stbir__simdf8_load1b( out, ptr ) (out) = _mm256_broadcast_ss( ptr ) 1493 #define stbir__simdf_load1rep4( out, ptr ) (out) = _mm_broadcast_ss( ptr ) // avx load instruction 1494 1495 #define stbir__simdi8_convert_i32_to_float(out, ireg) (out) = _mm256_cvtepi32_ps( ireg ) 1496 #define stbir__simdf8_convert_float_to_i32( i, f ) (i) = _mm256_cvttps_epi32(f) 1497 1498 #define stbir__simdf8_bot4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (0<<0)+(2<<4) ) 1499 #define stbir__simdf8_top4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (1<<0)+(3<<4) ) 1500 1501 #define stbir__simdf8_gettop4( reg ) _mm256_extractf128_ps(reg,1) 1502 1503 #ifdef STBIR_AVX2 1504 1505 #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \ 1506 { \ 1507 stbir__simdi8 a, zero =_mm256_setzero_si256();\ 1508 a = _mm256_permute4x64_epi64( _mm256_unpacklo_epi8( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), zero ),(0<<0)+(2<<2)+(1<<4)+(3<<6)); \ 1509 out0 = _mm256_unpacklo_epi16( a, zero ); \ 1510 out1 = _mm256_unpackhi_epi16( a, zero ); \ 1511 } 1512 1513 #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \ 1514 { \ 1515 stbir__simdi8 t; \ 1516 stbir__simdf8 af,bf; \ 1517 stbir__simdi8 a,b; \ 1518 af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \ 1519 bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \ 1520 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \ 1521 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \ 1522 a = _mm256_cvttps_epi32( af ); \ 1523 b = _mm256_cvttps_epi32( bf ); \ 1524 t = _mm256_permute4x64_epi64( _mm256_packs_epi32( a, b ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \ 1525 out = _mm256_castsi256_si128( _mm256_permute4x64_epi64( _mm256_packus_epi16( t, t ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ) ); \ 1526 } 1527 1528 #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() ); 1529 1530 #define stbir__simdf8_pack_to_16words(out,aa,bb) \ 1531 { \ 1532 stbir__simdf8 af,bf; \ 1533 stbir__simdi8 a,b; \ 1534 af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \ 1535 bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \ 1536 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \ 1537 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \ 1538 a = _mm256_cvttps_epi32( af ); \ 1539 b = _mm256_cvttps_epi32( bf ); \ 1540 (out) = _mm256_permute4x64_epi64( _mm256_packus_epi32(a, b), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \ 1541 } 1542 1543 #else 1544 1545 #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \ 1546 { \ 1547 stbir__simdi a,zero = _mm_setzero_si128(); \ 1548 a = _mm_unpacklo_epi8( ireg, zero ); \ 1549 out0 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \ 1550 a = _mm_unpackhi_epi8( ireg, zero ); \ 1551 out1 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \ 1552 } 1553 1554 #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \ 1555 { \ 1556 stbir__simdi t; \ 1557 stbir__simdf8 af,bf; \ 1558 stbir__simdi8 a,b; \ 1559 af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \ 1560 bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \ 1561 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \ 1562 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \ 1563 a = _mm256_cvttps_epi32( af ); \ 1564 b = _mm256_cvttps_epi32( bf ); \ 1565 out = _mm_packs_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \ 1566 out = _mm_packus_epi16( out, out ); \ 1567 t = _mm_packs_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \ 1568 t = _mm_packus_epi16( t, t ); \ 1569 out = _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps(out), _mm_castsi128_ps(t), (0<<0)+(1<<2)+(0<<4)+(1<<6) ) ); \ 1570 } 1571 1572 #define stbir__simdi8_expand_u16_to_u32(out,ireg) \ 1573 { \ 1574 stbir__simdi a,b,zero = _mm_setzero_si128(); \ 1575 a = _mm_unpacklo_epi16( ireg, zero ); \ 1576 b = _mm_unpackhi_epi16( ireg, zero ); \ 1577 out = _mm256_insertf128_si256( _mm256_castsi128_si256( a ), b, 1 ); \ 1578 } 1579 1580 #define stbir__simdf8_pack_to_16words(out,aa,bb) \ 1581 { \ 1582 stbir__simdi t0,t1; \ 1583 stbir__simdf8 af,bf; \ 1584 stbir__simdi8 a,b; \ 1585 af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \ 1586 bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \ 1587 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \ 1588 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \ 1589 a = _mm256_cvttps_epi32( af ); \ 1590 b = _mm256_cvttps_epi32( bf ); \ 1591 t0 = _mm_packus_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \ 1592 t1 = _mm_packus_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \ 1593 out = _mm256_setr_m128i( t0, t1 ); \ 1594 } 1595 1596 #endif 1597 1598 static __m256i stbir_00001111 = { STBIR__CONST_4d_32i( 0, 0, 0, 0 ), STBIR__CONST_4d_32i( 1, 1, 1, 1 ) }; 1599 #define stbir__simdf8_0123to00001111( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00001111 ) 1600 1601 static __m256i stbir_22223333 = { STBIR__CONST_4d_32i( 2, 2, 2, 2 ), STBIR__CONST_4d_32i( 3, 3, 3, 3 ) }; 1602 #define stbir__simdf8_0123to22223333( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_22223333 ) 1603 1604 #define stbir__simdf8_0123to2222( out, in ) (out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2,2,2,2 ) 1605 1606 #define stbir__simdf8_load4b( out, ptr ) (out) = _mm256_broadcast_ps( (__m128 const *)(ptr) ) 1607 1608 static __m256i stbir_00112233 = { STBIR__CONST_4d_32i( 0, 0, 1, 1 ), STBIR__CONST_4d_32i( 2, 2, 3, 3 ) }; 1609 #define stbir__simdf8_0123to00112233( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00112233 ) 1610 #define stbir__simdf8_add4( out, a8, b ) (out) = _mm256_add_ps( a8, _mm256_castps128_ps256( b ) ) 1611 1612 static __m256i stbir_load6 = { STBIR__CONST_4_32i( 0x80000000 ), STBIR__CONST_4d_32i( 0x80000000, 0x80000000, 0, 0 ) }; 1613 #define stbir__simdf8_load6z( out, ptr ) (out) = _mm256_maskload_ps( ptr, stbir_load6 ) 1614 1615 #define stbir__simdf8_0123to00000000( out, in ) (out) = _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(0<<4)+(0<<6) ) 1616 #define stbir__simdf8_0123to11111111( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(1<<4)+(1<<6) ) 1617 #define stbir__simdf8_0123to22222222( out, in ) (out) = _mm256_shuffle_ps ( in, in, (2<<0)+(2<<2)+(2<<4)+(2<<6) ) 1618 #define stbir__simdf8_0123to33333333( out, in ) (out) = _mm256_shuffle_ps ( in, in, (3<<0)+(3<<2)+(3<<4)+(3<<6) ) 1619 #define stbir__simdf8_0123to21032103( out, in ) (out) = _mm256_shuffle_ps ( in, in, (2<<0)+(1<<2)+(0<<4)+(3<<6) ) 1620 #define stbir__simdf8_0123to32103210( out, in ) (out) = _mm256_shuffle_ps ( in, in, (3<<0)+(2<<2)+(1<<4)+(0<<6) ) 1621 #define stbir__simdf8_0123to12301230( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(2<<2)+(3<<4)+(0<<6) ) 1622 #define stbir__simdf8_0123to10321032( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(0<<2)+(3<<4)+(2<<6) ) 1623 #define stbir__simdf8_0123to30123012( out, in ) (out) = _mm256_shuffle_ps ( in, in, (3<<0)+(0<<2)+(1<<4)+(2<<6) ) 1624 1625 #define stbir__simdf8_0123to11331133( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(3<<4)+(3<<6) ) 1626 #define stbir__simdf8_0123to00220022( out, in ) (out) = _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(2<<4)+(2<<6) ) 1627 1628 #define stbir__simdf8_aaa1( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(1<<1)+(1<<2)+(0<<3)+(1<<4)+(1<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (3<<0) + (3<<2) + (3<<4) + (0<<6) ) 1629 #define stbir__simdf8_1aaa( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(1<<2)+(1<<3)+(0<<4)+(1<<5)+(1<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (0<<4) + (0<<6) ) 1630 #define stbir__simdf8_a1a1( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(0<<1)+(1<<2)+(0<<3)+(1<<4)+(0<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) ) 1631 #define stbir__simdf8_1a1a( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(0<<2)+(1<<3)+(0<<4)+(1<<5)+(0<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) ) 1632 1633 #define stbir__simdf8_zero( reg ) (reg) = _mm256_setzero_ps() 1634 1635 #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd 1636 #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_fmadd_ps( mul1, mul2, add ) 1637 #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ), add ) 1638 #define stbir__simdf8_madd_mem4( out, add, mul, ptr )(out) = _mm256_fmadd_ps( _mm256_setr_m128( mul, _mm_setzero_ps() ), _mm256_setr_m128( _mm_loadu_ps( (float const*)(ptr) ), _mm_setzero_ps() ), add ) 1639 #else 1640 #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul1, mul2 ) ) 1641 #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ) ) ) 1642 #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_setr_m128( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ), _mm_setzero_ps() ) ) 1643 #endif 1644 #define stbir__if_simdf8_cast_to_simdf4( val ) _mm256_castps256_ps128( val ) 1645 1646 #endif 1647 1648 #ifdef STBIR_FLOORF 1649 #undef STBIR_FLOORF 1650 #endif 1651 #define STBIR_FLOORF stbir_simd_floorf 1652 static stbir__inline float stbir_simd_floorf(float x) // martins floorf 1653 { 1654 #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41) 1655 __m128 t = _mm_set_ss(x); 1656 return _mm_cvtss_f32( _mm_floor_ss(t, t) ); 1657 #else 1658 __m128 f = _mm_set_ss(x); 1659 __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f)); 1660 __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(f, t), _mm_set_ss(-1.0f))); 1661 return _mm_cvtss_f32(r); 1662 #endif 1663 } 1664 1665 #ifdef STBIR_CEILF 1666 #undef STBIR_CEILF 1667 #endif 1668 #define STBIR_CEILF stbir_simd_ceilf 1669 static stbir__inline float stbir_simd_ceilf(float x) // martins ceilf 1670 { 1671 #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41) 1672 __m128 t = _mm_set_ss(x); 1673 return _mm_cvtss_f32( _mm_ceil_ss(t, t) ); 1674 #else 1675 __m128 f = _mm_set_ss(x); 1676 __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f)); 1677 __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(t, f), _mm_set_ss(1.0f))); 1678 return _mm_cvtss_f32(r); 1679 #endif 1680 } 1681 1682 #elif defined(STBIR_NEON) 1683 1684 #include <arm_neon.h> 1685 1686 #define stbir__simdf float32x4_t 1687 #define stbir__simdi uint32x4_t 1688 1689 #define stbir_simdi_castf( reg ) vreinterpretq_u32_f32(reg) 1690 #define stbir_simdf_casti( reg ) vreinterpretq_f32_u32(reg) 1691 1692 #define stbir__simdf_load( reg, ptr ) (reg) = vld1q_f32( (float const*)(ptr) ) 1693 #define stbir__simdi_load( reg, ptr ) (reg) = vld1q_u32( (uint32_t const*)(ptr) ) 1694 #define stbir__simdf_load1( out, ptr ) (out) = vld1q_dup_f32( (float const*)(ptr) ) // top values can be random (not denormal or nan for perf) 1695 #define stbir__simdi_load1( out, ptr ) (out) = vld1q_dup_u32( (uint32_t const*)(ptr) ) 1696 #define stbir__simdf_load1z( out, ptr ) (out) = vld1q_lane_f32( (float const*)(ptr), vdupq_n_f32(0), 0 ) // top values must be zero 1697 #define stbir__simdf_frep4( fvar ) vdupq_n_f32( fvar ) 1698 #define stbir__simdf_load1frep4( out, fvar ) (out) = vdupq_n_f32( fvar ) 1699 #define stbir__simdf_load2( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) ) // top values can be random (not denormal or nan for perf) 1700 #define stbir__simdf_load2z( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) ) // top values must be zero 1701 #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = vcombine_f32( vget_low_f32(reg), vld1_f32( (float const*)(ptr) ) ) 1702 1703 #define stbir__simdf_zeroP() vdupq_n_f32(0) 1704 #define stbir__simdf_zero( reg ) (reg) = vdupq_n_f32(0) 1705 1706 #define stbir__simdf_store( ptr, reg ) vst1q_f32( (float*)(ptr), reg ) 1707 #define stbir__simdf_store1( ptr, reg ) vst1q_lane_f32( (float*)(ptr), reg, 0) 1708 #define stbir__simdf_store2( ptr, reg ) vst1_f32( (float*)(ptr), vget_low_f32(reg) ) 1709 #define stbir__simdf_store2h( ptr, reg ) vst1_f32( (float*)(ptr), vget_high_f32(reg) ) 1710 1711 #define stbir__simdi_store( ptr, reg ) vst1q_u32( (uint32_t*)(ptr), reg ) 1712 #define stbir__simdi_store1( ptr, reg ) vst1q_lane_u32( (uint32_t*)(ptr), reg, 0 ) 1713 #define stbir__simdi_store2( ptr, reg ) vst1_u32( (uint32_t*)(ptr), vget_low_u32(reg) ) 1714 1715 #define stbir__prefetch( ptr ) 1716 1717 #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \ 1718 { \ 1719 uint16x8_t l = vmovl_u8( vget_low_u8 ( vreinterpretq_u8_u32(ireg) ) ); \ 1720 uint16x8_t h = vmovl_u8( vget_high_u8( vreinterpretq_u8_u32(ireg) ) ); \ 1721 out0 = vmovl_u16( vget_low_u16 ( l ) ); \ 1722 out1 = vmovl_u16( vget_high_u16( l ) ); \ 1723 out2 = vmovl_u16( vget_low_u16 ( h ) ); \ 1724 out3 = vmovl_u16( vget_high_u16( h ) ); \ 1725 } 1726 1727 #define stbir__simdi_expand_u8_to_1u32(out,ireg) \ 1728 { \ 1729 uint16x8_t tmp = vmovl_u8( vget_low_u8( vreinterpretq_u8_u32(ireg) ) ); \ 1730 out = vmovl_u16( vget_low_u16( tmp ) ); \ 1731 } 1732 1733 #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \ 1734 { \ 1735 uint16x8_t tmp = vreinterpretq_u16_u32(ireg); \ 1736 out0 = vmovl_u16( vget_low_u16 ( tmp ) ); \ 1737 out1 = vmovl_u16( vget_high_u16( tmp ) ); \ 1738 } 1739 1740 #define stbir__simdf_convert_float_to_i32( i, f ) (i) = vreinterpretq_u32_s32( vcvtq_s32_f32(f) ) 1741 #define stbir__simdf_convert_float_to_int( f ) vgetq_lane_s32(vcvtq_s32_f32(f), 0) 1742 #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0) 1743 #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),vdupq_n_f32(0))), 0)) 1744 #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),vdupq_n_f32(0))), 0)) 1745 #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = vcvtq_f32_s32( vreinterpretq_s32_u32(ireg) ) 1746 #define stbir__simdf_add( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 ) 1747 #define stbir__simdf_mult( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 ) 1748 #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_f32( (float const*)(ptr) ) ) 1749 #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) ) 1750 #define stbir__simdf_add_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_f32( (float const*)(ptr) ) ) 1751 #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) ) 1752 1753 #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd (and also x64 no madd to arm madd) 1754 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 ) 1755 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 ) 1756 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_f32( (float const*)(ptr) ) ) 1757 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_dup_f32( (float const*)(ptr) ) ) 1758 #else 1759 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) ) 1760 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) ) 1761 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_f32( (float const*)(ptr) ) ) ) 1762 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_dup_f32( (float const*)(ptr) ) ) ) 1763 #endif 1764 1765 #define stbir__simdf_add1( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 ) 1766 #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 ) 1767 1768 #define stbir__simdf_and( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vandq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) ) 1769 #define stbir__simdf_or( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vorrq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) ) 1770 1771 #define stbir__simdf_min( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 ) 1772 #define stbir__simdf_max( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 ) 1773 #define stbir__simdf_min1( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 ) 1774 #define stbir__simdf_max1( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 ) 1775 1776 #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 3 ) 1777 #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 2 ) 1778 1779 #define stbir__simdf_a1a1( out, alp, ones ) (out) = vzipq_f32(vuzpq_f32(alp, alp).val[1], ones).val[0] 1780 #define stbir__simdf_1a1a( out, alp, ones ) (out) = vzipq_f32(ones, vuzpq_f32(alp, alp).val[0]).val[0] 1781 1782 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) 1783 1784 #define stbir__simdf_aaa1( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3, ones, 3) 1785 #define stbir__simdf_1aaa( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0, ones, 0) 1786 1787 #if defined( _MSC_VER ) && !defined(__clang__) 1788 #define stbir_make16(a,b,c,d) vcombine_u8( \ 1789 vcreate_u8( (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \ 1790 ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56)), \ 1791 vcreate_u8( (4*c+0) | ((4*c+1)<<8) | ((4*c+2)<<16) | ((4*c+3)<<24) | \ 1792 ((stbir_uint64)(4*d+0)<<32) | ((stbir_uint64)(4*d+1)<<40) | ((stbir_uint64)(4*d+2)<<48) | ((stbir_uint64)(4*d+3)<<56) ) ) 1793 1794 static stbir__inline uint8x16x2_t stbir_make16x2(float32x4_t rega,float32x4_t regb) 1795 { 1796 uint8x16x2_t r = { vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb) }; 1797 return r; 1798 } 1799 #else 1800 #define stbir_make16(a,b,c,d) (uint8x16_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3,4*c+0,4*c+1,4*c+2,4*c+3,4*d+0,4*d+1,4*d+2,4*d+3} 1801 #define stbir_make16x2(a,b) (uint8x16x2_t){{vreinterpretq_u8_f32(a),vreinterpretq_u8_f32(b)}} 1802 #endif 1803 1804 #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vqtbl1q_u8( vreinterpretq_u8_f32(reg), stbir_make16(one, two, three, four) ) ) 1805 #define stbir__simdf_swiz2( rega, regb, one, two, three, four ) vreinterpretq_f32_u8( vqtbl2q_u8( stbir_make16x2(rega,regb), stbir_make16(one, two, three, four) ) ) 1806 1807 #define stbir__simdi_16madd( out, reg0, reg1 ) \ 1808 { \ 1809 int16x8_t r0 = vreinterpretq_s16_u32(reg0); \ 1810 int16x8_t r1 = vreinterpretq_s16_u32(reg1); \ 1811 int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \ 1812 int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \ 1813 (out) = vreinterpretq_u32_s32( vpaddq_s32(tmp0, tmp1) ); \ 1814 } 1815 1816 #else 1817 1818 #define stbir__simdf_aaa1( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3) 1819 #define stbir__simdf_1aaa( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0) 1820 1821 #if defined( _MSC_VER ) && !defined(__clang__) 1822 static stbir__inline uint8x8x2_t stbir_make8x2(float32x4_t reg) 1823 { 1824 uint8x8x2_t r = { { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } }; 1825 return r; 1826 } 1827 #define stbir_make8(a,b) vcreate_u8( \ 1828 (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \ 1829 ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56) ) 1830 #else 1831 #define stbir_make8x2(reg) (uint8x8x2_t){ { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } } 1832 #define stbir_make8(a,b) (uint8x8_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3} 1833 #endif 1834 1835 #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vcombine_u8( \ 1836 vtbl2_u8( stbir_make8x2( reg ), stbir_make8( one, two ) ), \ 1837 vtbl2_u8( stbir_make8x2( reg ), stbir_make8( three, four ) ) ) ) 1838 1839 #define stbir__simdi_16madd( out, reg0, reg1 ) \ 1840 { \ 1841 int16x8_t r0 = vreinterpretq_s16_u32(reg0); \ 1842 int16x8_t r1 = vreinterpretq_s16_u32(reg1); \ 1843 int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \ 1844 int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \ 1845 int32x2_t out0 = vpadd_s32( vget_low_s32(tmp0), vget_high_s32(tmp0) ); \ 1846 int32x2_t out1 = vpadd_s32( vget_low_s32(tmp1), vget_high_s32(tmp1) ); \ 1847 (out) = vreinterpretq_u32_s32( vcombine_s32(out0, out1) ); \ 1848 } 1849 1850 #endif 1851 1852 #define stbir__simdi_and( out, reg0, reg1 ) (out) = vandq_u32( reg0, reg1 ) 1853 #define stbir__simdi_or( out, reg0, reg1 ) (out) = vorrq_u32( reg0, reg1 ) 1854 1855 #define stbir__simdf_pack_to_8bytes(out,aa,bb) \ 1856 { \ 1857 float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \ 1858 float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \ 1859 int16x4_t ai = vqmovn_s32( vcvtq_s32_f32( af ) ); \ 1860 int16x4_t bi = vqmovn_s32( vcvtq_s32_f32( bf ) ); \ 1861 uint8x8_t out8 = vqmovun_s16( vcombine_s16(ai, bi) ); \ 1862 out = vreinterpretq_u32_u8( vcombine_u8(out8, out8) ); \ 1863 } 1864 1865 #define stbir__simdf_pack_to_8words(out,aa,bb) \ 1866 { \ 1867 float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \ 1868 float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \ 1869 int32x4_t ai = vcvtq_s32_f32( af ); \ 1870 int32x4_t bi = vcvtq_s32_f32( bf ); \ 1871 out = vreinterpretq_u32_u16( vcombine_u16(vqmovun_s32(ai), vqmovun_s32(bi)) ); \ 1872 } 1873 1874 #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \ 1875 { \ 1876 int16x4x2_t tmp0 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r0)), vqmovn_s32(vreinterpretq_s32_u32(r2)) ); \ 1877 int16x4x2_t tmp1 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r1)), vqmovn_s32(vreinterpretq_s32_u32(r3)) ); \ 1878 uint8x8x2_t out = \ 1879 { { \ 1880 vqmovun_s16( vcombine_s16(tmp0.val[0], tmp0.val[1]) ), \ 1881 vqmovun_s16( vcombine_s16(tmp1.val[0], tmp1.val[1]) ), \ 1882 } }; \ 1883 vst2_u8(ptr, out); \ 1884 } 1885 1886 #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \ 1887 { \ 1888 float32x4x4_t tmp = vld4q_f32(ptr); \ 1889 o0 = tmp.val[0]; \ 1890 o1 = tmp.val[1]; \ 1891 o2 = tmp.val[2]; \ 1892 o3 = tmp.val[3]; \ 1893 } 1894 1895 #define stbir__simdi_32shr( out, reg, imm ) out = vshrq_n_u32( reg, imm ) 1896 1897 #if defined( _MSC_VER ) && !defined(__clang__) 1898 #define STBIR__SIMDF_CONST(var, x) __declspec(align(8)) float var[] = { x, x, x, x } 1899 #define STBIR__SIMDI_CONST(var, x) __declspec(align(8)) uint32_t var[] = { x, x, x, x } 1900 #define STBIR__CONSTF(var) (*(const float32x4_t*)var) 1901 #define STBIR__CONSTI(var) (*(const uint32x4_t*)var) 1902 #else 1903 #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x } 1904 #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x } 1905 #define STBIR__CONSTF(var) (var) 1906 #define STBIR__CONSTI(var) (var) 1907 #endif 1908 1909 #ifdef STBIR_FLOORF 1910 #undef STBIR_FLOORF 1911 #endif 1912 #define STBIR_FLOORF stbir_simd_floorf 1913 static stbir__inline float stbir_simd_floorf(float x) 1914 { 1915 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) 1916 return vget_lane_f32( vrndm_f32( vdup_n_f32(x) ), 0); 1917 #else 1918 float32x2_t f = vdup_n_f32(x); 1919 float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f)); 1920 uint32x2_t a = vclt_f32(f, t); 1921 uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(-1.0f)); 1922 float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b))); 1923 return vget_lane_f32(r, 0); 1924 #endif 1925 } 1926 1927 #ifdef STBIR_CEILF 1928 #undef STBIR_CEILF 1929 #endif 1930 #define STBIR_CEILF stbir_simd_ceilf 1931 static stbir__inline float stbir_simd_ceilf(float x) 1932 { 1933 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) 1934 return vget_lane_f32( vrndp_f32( vdup_n_f32(x) ), 0); 1935 #else 1936 float32x2_t f = vdup_n_f32(x); 1937 float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f)); 1938 uint32x2_t a = vclt_f32(t, f); 1939 uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(1.0f)); 1940 float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b))); 1941 return vget_lane_f32(r, 0); 1942 #endif 1943 } 1944 1945 #define STBIR_SIMD 1946 1947 #elif defined(STBIR_WASM) 1948 1949 #include <wasm_simd128.h> 1950 1951 #define stbir__simdf v128_t 1952 #define stbir__simdi v128_t 1953 1954 #define stbir_simdi_castf( reg ) (reg) 1955 #define stbir_simdf_casti( reg ) (reg) 1956 1957 #define stbir__simdf_load( reg, ptr ) (reg) = wasm_v128_load( (void const*)(ptr) ) 1958 #define stbir__simdi_load( reg, ptr ) (reg) = wasm_v128_load( (void const*)(ptr) ) 1959 #define stbir__simdf_load1( out, ptr ) (out) = wasm_v128_load32_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf) 1960 #define stbir__simdi_load1( out, ptr ) (out) = wasm_v128_load32_splat( (void const*)(ptr) ) 1961 #define stbir__simdf_load1z( out, ptr ) (out) = wasm_v128_load32_zero( (void const*)(ptr) ) // top values must be zero 1962 #define stbir__simdf_frep4( fvar ) wasm_f32x4_splat( fvar ) 1963 #define stbir__simdf_load1frep4( out, fvar ) (out) = wasm_f32x4_splat( fvar ) 1964 #define stbir__simdf_load2( out, ptr ) (out) = wasm_v128_load64_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf) 1965 #define stbir__simdf_load2z( out, ptr ) (out) = wasm_v128_load64_zero( (void const*)(ptr) ) // top values must be zero 1966 #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = wasm_v128_load64_lane( (void const*)(ptr), reg, 1 ) 1967 1968 #define stbir__simdf_zeroP() wasm_f32x4_const_splat(0) 1969 #define stbir__simdf_zero( reg ) (reg) = wasm_f32x4_const_splat(0) 1970 1971 #define stbir__simdf_store( ptr, reg ) wasm_v128_store( (void*)(ptr), reg ) 1972 #define stbir__simdf_store1( ptr, reg ) wasm_v128_store32_lane( (void*)(ptr), reg, 0 ) 1973 #define stbir__simdf_store2( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 0 ) 1974 #define stbir__simdf_store2h( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 1 ) 1975 1976 #define stbir__simdi_store( ptr, reg ) wasm_v128_store( (void*)(ptr), reg ) 1977 #define stbir__simdi_store1( ptr, reg ) wasm_v128_store32_lane( (void*)(ptr), reg, 0 ) 1978 #define stbir__simdi_store2( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 0 ) 1979 1980 #define stbir__prefetch( ptr ) 1981 1982 #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \ 1983 { \ 1984 v128_t l = wasm_u16x8_extend_low_u8x16 ( ireg ); \ 1985 v128_t h = wasm_u16x8_extend_high_u8x16( ireg ); \ 1986 out0 = wasm_u32x4_extend_low_u16x8 ( l ); \ 1987 out1 = wasm_u32x4_extend_high_u16x8( l ); \ 1988 out2 = wasm_u32x4_extend_low_u16x8 ( h ); \ 1989 out3 = wasm_u32x4_extend_high_u16x8( h ); \ 1990 } 1991 1992 #define stbir__simdi_expand_u8_to_1u32(out,ireg) \ 1993 { \ 1994 v128_t tmp = wasm_u16x8_extend_low_u8x16(ireg); \ 1995 out = wasm_u32x4_extend_low_u16x8(tmp); \ 1996 } 1997 1998 #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \ 1999 { \ 2000 out0 = wasm_u32x4_extend_low_u16x8 ( ireg ); \ 2001 out1 = wasm_u32x4_extend_high_u16x8( ireg ); \ 2002 } 2003 2004 #define stbir__simdf_convert_float_to_i32( i, f ) (i) = wasm_i32x4_trunc_sat_f32x4(f) 2005 #define stbir__simdf_convert_float_to_int( f ) wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0) 2006 #define stbir__simdi_to_int( i ) wasm_i32x4_extract_lane(i, 0) 2007 #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint8_as_float),wasm_f32x4_const_splat(0))), 0)) 2008 #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint16_as_float),wasm_f32x4_const_splat(0))), 0)) 2009 #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = wasm_f32x4_convert_i32x4(ireg) 2010 #define stbir__simdf_add( out, reg0, reg1 ) (out) = wasm_f32x4_add( reg0, reg1 ) 2011 #define stbir__simdf_mult( out, reg0, reg1 ) (out) = wasm_f32x4_mul( reg0, reg1 ) 2012 #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = wasm_f32x4_mul( reg, wasm_v128_load( (void const*)(ptr) ) ) 2013 #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = wasm_f32x4_mul( reg, wasm_v128_load32_splat( (void const*)(ptr) ) ) 2014 #define stbir__simdf_add_mem( out, reg, ptr ) (out) = wasm_f32x4_add( reg, wasm_v128_load( (void const*)(ptr) ) ) 2015 #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = wasm_f32x4_add( reg, wasm_v128_load32_splat( (void const*)(ptr) ) ) 2016 2017 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) ) 2018 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) ) 2019 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load( (void const*)(ptr) ) ) ) 2020 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load32_splat( (void const*)(ptr) ) ) ) 2021 2022 #define stbir__simdf_add1( out, reg0, reg1 ) (out) = wasm_f32x4_add( reg0, reg1 ) 2023 #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = wasm_f32x4_mul( reg0, reg1 ) 2024 2025 #define stbir__simdf_and( out, reg0, reg1 ) (out) = wasm_v128_and( reg0, reg1 ) 2026 #define stbir__simdf_or( out, reg0, reg1 ) (out) = wasm_v128_or( reg0, reg1 ) 2027 2028 #define stbir__simdf_min( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 ) 2029 #define stbir__simdf_max( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 ) 2030 #define stbir__simdf_min1( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 ) 2031 #define stbir__simdf_max1( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 ) 2032 2033 #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 3, 4, 5, -1 ) 2034 #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 2, 3, 4, -1 ) 2035 2036 #define stbir__simdf_aaa1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 3, 3, 3, 4) 2037 #define stbir__simdf_1aaa(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 0, 0) 2038 #define stbir__simdf_a1a1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 1, 4, 3, 4) 2039 #define stbir__simdf_1a1a(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 4, 2) 2040 2041 #define stbir__simdf_swiz( reg, one, two, three, four ) wasm_i32x4_shuffle(reg, reg, one, two, three, four) 2042 2043 #define stbir__simdi_and( out, reg0, reg1 ) (out) = wasm_v128_and( reg0, reg1 ) 2044 #define stbir__simdi_or( out, reg0, reg1 ) (out) = wasm_v128_or( reg0, reg1 ) 2045 #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = wasm_i32x4_dot_i16x8( reg0, reg1 ) 2046 2047 #define stbir__simdf_pack_to_8bytes(out,aa,bb) \ 2048 { \ 2049 v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \ 2050 v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \ 2051 v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \ 2052 v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \ 2053 v128_t out16 = wasm_i16x8_narrow_i32x4( ai, bi ); \ 2054 out = wasm_u8x16_narrow_i16x8( out16, out16 ); \ 2055 } 2056 2057 #define stbir__simdf_pack_to_8words(out,aa,bb) \ 2058 { \ 2059 v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \ 2060 v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \ 2061 v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \ 2062 v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \ 2063 out = wasm_u16x8_narrow_i32x4( ai, bi ); \ 2064 } 2065 2066 #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \ 2067 { \ 2068 v128_t tmp0 = wasm_i16x8_narrow_i32x4(r0, r1); \ 2069 v128_t tmp1 = wasm_i16x8_narrow_i32x4(r2, r3); \ 2070 v128_t tmp = wasm_u8x16_narrow_i16x8(tmp0, tmp1); \ 2071 tmp = wasm_i8x16_shuffle(tmp, tmp, 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15); \ 2072 wasm_v128_store( (void*)(ptr), tmp); \ 2073 } 2074 2075 #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \ 2076 { \ 2077 v128_t t0 = wasm_v128_load( ptr ); \ 2078 v128_t t1 = wasm_v128_load( ptr+4 ); \ 2079 v128_t t2 = wasm_v128_load( ptr+8 ); \ 2080 v128_t t3 = wasm_v128_load( ptr+12 ); \ 2081 v128_t s0 = wasm_i32x4_shuffle(t0, t1, 0, 4, 2, 6); \ 2082 v128_t s1 = wasm_i32x4_shuffle(t0, t1, 1, 5, 3, 7); \ 2083 v128_t s2 = wasm_i32x4_shuffle(t2, t3, 0, 4, 2, 6); \ 2084 v128_t s3 = wasm_i32x4_shuffle(t2, t3, 1, 5, 3, 7); \ 2085 o0 = wasm_i32x4_shuffle(s0, s2, 0, 1, 4, 5); \ 2086 o1 = wasm_i32x4_shuffle(s1, s3, 0, 1, 4, 5); \ 2087 o2 = wasm_i32x4_shuffle(s0, s2, 2, 3, 6, 7); \ 2088 o3 = wasm_i32x4_shuffle(s1, s3, 2, 3, 6, 7); \ 2089 } 2090 2091 #define stbir__simdi_32shr( out, reg, imm ) out = wasm_u32x4_shr( reg, imm ) 2092 2093 typedef float stbir__f32x4 __attribute__((__vector_size__(16), __aligned__(16))); 2094 #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = (v128_t)(stbir__f32x4){ x, x, x, x } 2095 #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x } 2096 #define STBIR__CONSTF(var) (var) 2097 #define STBIR__CONSTI(var) (var) 2098 2099 #ifdef STBIR_FLOORF 2100 #undef STBIR_FLOORF 2101 #endif 2102 #define STBIR_FLOORF stbir_simd_floorf 2103 static stbir__inline float stbir_simd_floorf(float x) 2104 { 2105 return wasm_f32x4_extract_lane( wasm_f32x4_floor( wasm_f32x4_splat(x) ), 0); 2106 } 2107 2108 #ifdef STBIR_CEILF 2109 #undef STBIR_CEILF 2110 #endif 2111 #define STBIR_CEILF stbir_simd_ceilf 2112 static stbir__inline float stbir_simd_ceilf(float x) 2113 { 2114 return wasm_f32x4_extract_lane( wasm_f32x4_ceil( wasm_f32x4_splat(x) ), 0); 2115 } 2116 2117 #define STBIR_SIMD 2118 2119 #endif // SSE2/NEON/WASM 2120 2121 #endif // NO SIMD 2122 2123 #ifdef STBIR_SIMD8 2124 #define stbir__simdfX stbir__simdf8 2125 #define stbir__simdiX stbir__simdi8 2126 #define stbir__simdfX_load stbir__simdf8_load 2127 #define stbir__simdiX_load stbir__simdi8_load 2128 #define stbir__simdfX_mult stbir__simdf8_mult 2129 #define stbir__simdfX_add_mem stbir__simdf8_add_mem 2130 #define stbir__simdfX_madd_mem stbir__simdf8_madd_mem 2131 #define stbir__simdfX_store stbir__simdf8_store 2132 #define stbir__simdiX_store stbir__simdi8_store 2133 #define stbir__simdf_frepX stbir__simdf8_frep8 2134 #define stbir__simdfX_madd stbir__simdf8_madd 2135 #define stbir__simdfX_min stbir__simdf8_min 2136 #define stbir__simdfX_max stbir__simdf8_max 2137 #define stbir__simdfX_aaa1 stbir__simdf8_aaa1 2138 #define stbir__simdfX_1aaa stbir__simdf8_1aaa 2139 #define stbir__simdfX_a1a1 stbir__simdf8_a1a1 2140 #define stbir__simdfX_1a1a stbir__simdf8_1a1a 2141 #define stbir__simdfX_convert_float_to_i32 stbir__simdf8_convert_float_to_i32 2142 #define stbir__simdfX_pack_to_words stbir__simdf8_pack_to_16words 2143 #define stbir__simdfX_zero stbir__simdf8_zero 2144 #define STBIR_onesX STBIR_ones8 2145 #define STBIR_max_uint8_as_floatX STBIR_max_uint8_as_float8 2146 #define STBIR_max_uint16_as_floatX STBIR_max_uint16_as_float8 2147 #define STBIR_simd_point5X STBIR_simd_point58 2148 #define stbir__simdfX_float_count 8 2149 #define stbir__simdfX_0123to1230 stbir__simdf8_0123to12301230 2150 #define stbir__simdfX_0123to2103 stbir__simdf8_0123to21032103 2151 static const stbir__simdf8 STBIR_max_uint16_as_float_inverted8 = { stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted }; 2152 static const stbir__simdf8 STBIR_max_uint8_as_float_inverted8 = { stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted }; 2153 static const stbir__simdf8 STBIR_ones8 = { 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0 }; 2154 static const stbir__simdf8 STBIR_simd_point58 = { 0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 }; 2155 static const stbir__simdf8 STBIR_max_uint8_as_float8 = { stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float, stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float }; 2156 static const stbir__simdf8 STBIR_max_uint16_as_float8 = { stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float, stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float }; 2157 #else 2158 #define stbir__simdfX stbir__simdf 2159 #define stbir__simdiX stbir__simdi 2160 #define stbir__simdfX_load stbir__simdf_load 2161 #define stbir__simdiX_load stbir__simdi_load 2162 #define stbir__simdfX_mult stbir__simdf_mult 2163 #define stbir__simdfX_add_mem stbir__simdf_add_mem 2164 #define stbir__simdfX_madd_mem stbir__simdf_madd_mem 2165 #define stbir__simdfX_store stbir__simdf_store 2166 #define stbir__simdiX_store stbir__simdi_store 2167 #define stbir__simdf_frepX stbir__simdf_frep4 2168 #define stbir__simdfX_madd stbir__simdf_madd 2169 #define stbir__simdfX_min stbir__simdf_min 2170 #define stbir__simdfX_max stbir__simdf_max 2171 #define stbir__simdfX_aaa1 stbir__simdf_aaa1 2172 #define stbir__simdfX_1aaa stbir__simdf_1aaa 2173 #define stbir__simdfX_a1a1 stbir__simdf_a1a1 2174 #define stbir__simdfX_1a1a stbir__simdf_1a1a 2175 #define stbir__simdfX_convert_float_to_i32 stbir__simdf_convert_float_to_i32 2176 #define stbir__simdfX_pack_to_words stbir__simdf_pack_to_8words 2177 #define stbir__simdfX_zero stbir__simdf_zero 2178 #define STBIR_onesX STBIR__CONSTF(STBIR_ones) 2179 #define STBIR_simd_point5X STBIR__CONSTF(STBIR_simd_point5) 2180 #define STBIR_max_uint8_as_floatX STBIR__CONSTF(STBIR_max_uint8_as_float) 2181 #define STBIR_max_uint16_as_floatX STBIR__CONSTF(STBIR_max_uint16_as_float) 2182 #define stbir__simdfX_float_count 4 2183 #define stbir__if_simdf8_cast_to_simdf4( val ) ( val ) 2184 #define stbir__simdfX_0123to1230 stbir__simdf_0123to1230 2185 #define stbir__simdfX_0123to2103 stbir__simdf_0123to2103 2186 #endif 2187 2188 2189 #if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__) 2190 2191 #if defined( _MSC_VER ) && !defined(__clang__) 2192 typedef __int16 stbir__FP16; 2193 #else 2194 typedef float16_t stbir__FP16; 2195 #endif 2196 2197 #else // no NEON, or 32-bit ARM for MSVC 2198 2199 typedef union stbir__FP16 2200 { 2201 unsigned short u; 2202 } stbir__FP16; 2203 2204 #endif 2205 2206 #if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) || (defined(STBIR_NEON) && defined(_M_ARM)) || (defined(STBIR_NEON) && defined(__arm__)) 2207 2208 // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668 2209 2210 static stbir__inline float stbir__half_to_float( stbir__FP16 h ) 2211 { 2212 static const stbir__FP32 magic = { (254 - 15) << 23 }; 2213 static const stbir__FP32 was_infnan = { (127 + 16) << 23 }; 2214 stbir__FP32 o; 2215 2216 o.u = (h.u & 0x7fff) << 13; // exponent/mantissa bits 2217 o.f *= magic.f; // exponent adjust 2218 if (o.f >= was_infnan.f) // make sure Inf/NaN survive 2219 o.u |= 255 << 23; 2220 o.u |= (h.u & 0x8000) << 16; // sign bit 2221 return o.f; 2222 } 2223 2224 static stbir__inline stbir__FP16 stbir__float_to_half(float val) 2225 { 2226 stbir__FP32 f32infty = { 255 << 23 }; 2227 stbir__FP32 f16max = { (127 + 16) << 23 }; 2228 stbir__FP32 denorm_magic = { ((127 - 15) + (23 - 10) + 1) << 23 }; 2229 unsigned int sign_mask = 0x80000000u; 2230 stbir__FP16 o = { 0 }; 2231 stbir__FP32 f; 2232 unsigned int sign; 2233 2234 f.f = val; 2235 sign = f.u & sign_mask; 2236 f.u ^= sign; 2237 2238 if (f.u >= f16max.u) // result is Inf or NaN (all exponent bits set) 2239 o.u = (f.u > f32infty.u) ? 0x7e00 : 0x7c00; // NaN->qNaN and Inf->Inf 2240 else // (De)normalized number or zero 2241 { 2242 if (f.u < (113 << 23)) // resulting FP16 is subnormal or zero 2243 { 2244 // use a magic value to align our 10 mantissa bits at the bottom of 2245 // the float. as long as FP addition is round-to-nearest-even this 2246 // just works. 2247 f.f += denorm_magic.f; 2248 // and one integer subtract of the bias later, we have our final float! 2249 o.u = (unsigned short) ( f.u - denorm_magic.u ); 2250 } 2251 else 2252 { 2253 unsigned int mant_odd = (f.u >> 13) & 1; // resulting mantissa is odd 2254 // update exponent, rounding bias part 1 2255 f.u = f.u + ((15u - 127) << 23) + 0xfff; 2256 // rounding bias part 2 2257 f.u += mant_odd; 2258 // take the bits! 2259 o.u = (unsigned short) ( f.u >> 13 ); 2260 } 2261 } 2262 2263 o.u |= sign >> 16; 2264 return o; 2265 } 2266 2267 #endif 2268 2269 2270 #if defined(STBIR_FP16C) 2271 2272 #include <immintrin.h> 2273 2274 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) 2275 { 2276 _mm256_storeu_ps( (float*)output, _mm256_cvtph_ps( _mm_loadu_si128( (__m128i const* )input ) ) ); 2277 } 2278 2279 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input) 2280 { 2281 _mm_storeu_si128( (__m128i*)output, _mm256_cvtps_ph( _mm256_loadu_ps( input ), 0 ) ); 2282 } 2283 2284 static stbir__inline float stbir__half_to_float( stbir__FP16 h ) 2285 { 2286 return _mm_cvtss_f32( _mm_cvtph_ps( _mm_cvtsi32_si128( (int)h.u ) ) ); 2287 } 2288 2289 static stbir__inline stbir__FP16 stbir__float_to_half( float f ) 2290 { 2291 stbir__FP16 h; 2292 h.u = (unsigned short) _mm_cvtsi128_si32( _mm_cvtps_ph( _mm_set_ss( f ), 0 ) ); 2293 return h; 2294 } 2295 2296 #elif defined(STBIR_SSE2) 2297 2298 // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668 2299 stbir__inline static void stbir__half_to_float_SIMD(float * output, void const * input) 2300 { 2301 static const STBIR__SIMDI_CONST(mask_nosign, 0x7fff); 2302 static const STBIR__SIMDI_CONST(smallest_normal, 0x0400); 2303 static const STBIR__SIMDI_CONST(infinity, 0x7c00); 2304 static const STBIR__SIMDI_CONST(expadjust_normal, (127 - 15) << 23); 2305 static const STBIR__SIMDI_CONST(magic_denorm, 113 << 23); 2306 2307 __m128i i = _mm_loadu_si128 ( (__m128i const*)(input) ); 2308 __m128i h = _mm_unpacklo_epi16 ( i, _mm_setzero_si128() ); 2309 __m128i mnosign = STBIR__CONSTI(mask_nosign); 2310 __m128i eadjust = STBIR__CONSTI(expadjust_normal); 2311 __m128i smallest = STBIR__CONSTI(smallest_normal); 2312 __m128i infty = STBIR__CONSTI(infinity); 2313 __m128i expmant = _mm_and_si128(mnosign, h); 2314 __m128i justsign = _mm_xor_si128(h, expmant); 2315 __m128i b_notinfnan = _mm_cmpgt_epi32(infty, expmant); 2316 __m128i b_isdenorm = _mm_cmpgt_epi32(smallest, expmant); 2317 __m128i shifted = _mm_slli_epi32(expmant, 13); 2318 __m128i adj_infnan = _mm_andnot_si128(b_notinfnan, eadjust); 2319 __m128i adjusted = _mm_add_epi32(eadjust, shifted); 2320 __m128i den1 = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm)); 2321 __m128i adjusted2 = _mm_add_epi32(adjusted, adj_infnan); 2322 __m128 den2 = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm); 2323 __m128 adjusted3 = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm)); 2324 __m128 adjusted4 = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2)); 2325 __m128 adjusted5 = _mm_or_ps(adjusted3, adjusted4); 2326 __m128i sign = _mm_slli_epi32(justsign, 16); 2327 __m128 final = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign)); 2328 stbir__simdf_store( output + 0, final ); 2329 2330 h = _mm_unpackhi_epi16 ( i, _mm_setzero_si128() ); 2331 expmant = _mm_and_si128(mnosign, h); 2332 justsign = _mm_xor_si128(h, expmant); 2333 b_notinfnan = _mm_cmpgt_epi32(infty, expmant); 2334 b_isdenorm = _mm_cmpgt_epi32(smallest, expmant); 2335 shifted = _mm_slli_epi32(expmant, 13); 2336 adj_infnan = _mm_andnot_si128(b_notinfnan, eadjust); 2337 adjusted = _mm_add_epi32(eadjust, shifted); 2338 den1 = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm)); 2339 adjusted2 = _mm_add_epi32(adjusted, adj_infnan); 2340 den2 = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm); 2341 adjusted3 = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm)); 2342 adjusted4 = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2)); 2343 adjusted5 = _mm_or_ps(adjusted3, adjusted4); 2344 sign = _mm_slli_epi32(justsign, 16); 2345 final = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign)); 2346 stbir__simdf_store( output + 4, final ); 2347 2348 // ~38 SSE2 ops for 8 values 2349 } 2350 2351 // Fabian's round-to-nearest-even float to half 2352 // ~48 SSE2 ops for 8 output 2353 stbir__inline static void stbir__float_to_half_SIMD(void * output, float const * input) 2354 { 2355 static const STBIR__SIMDI_CONST(mask_sign, 0x80000000u); 2356 static const STBIR__SIMDI_CONST(c_f16max, (127 + 16) << 23); // all FP32 values >=this round to +inf 2357 static const STBIR__SIMDI_CONST(c_nanbit, 0x200); 2358 static const STBIR__SIMDI_CONST(c_infty_as_fp16, 0x7c00); 2359 static const STBIR__SIMDI_CONST(c_min_normal, (127 - 14) << 23); // smallest FP32 that yields a normalized FP16 2360 static const STBIR__SIMDI_CONST(c_subnorm_magic, ((127 - 15) + (23 - 10) + 1) << 23); 2361 static const STBIR__SIMDI_CONST(c_normal_bias, 0xfff - ((127 - 15) << 23)); // adjust exponent and add mantissa rounding 2362 2363 __m128 f = _mm_loadu_ps(input); 2364 __m128 msign = _mm_castsi128_ps(STBIR__CONSTI(mask_sign)); 2365 __m128 justsign = _mm_and_ps(msign, f); 2366 __m128 absf = _mm_xor_ps(f, justsign); 2367 __m128i absf_int = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit) 2368 __m128i f16max = STBIR__CONSTI(c_f16max); 2369 __m128 b_isnan = _mm_cmpunord_ps(absf, absf); // is this a NaN? 2370 __m128i b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special? 2371 __m128i nanbit = _mm_and_si128(_mm_castps_si128(b_isnan), STBIR__CONSTI(c_nanbit)); 2372 __m128i inf_or_nan = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials 2373 2374 __m128i min_normal = STBIR__CONSTI(c_min_normal); 2375 __m128i b_issub = _mm_cmpgt_epi32(min_normal, absf_int); 2376 2377 // "result is subnormal" path 2378 __m128 subnorm1 = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa 2379 __m128i subnorm2 = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias 2380 2381 // "result is normal" path 2382 __m128i mantoddbit = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign 2383 __m128i mantodd = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0 2384 2385 __m128i round1 = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias)); 2386 __m128i round2 = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE) 2387 __m128i normal = _mm_srli_epi32(round2, 13); // rounded result 2388 2389 // combine the two non-specials 2390 __m128i nonspecial = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal)); 2391 2392 // merge in specials as well 2393 __m128i joined = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan)); 2394 2395 __m128i sign_shift = _mm_srai_epi32(_mm_castps_si128(justsign), 16); 2396 __m128i final2, final= _mm_or_si128(joined, sign_shift); 2397 2398 f = _mm_loadu_ps(input+4); 2399 justsign = _mm_and_ps(msign, f); 2400 absf = _mm_xor_ps(f, justsign); 2401 absf_int = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit) 2402 b_isnan = _mm_cmpunord_ps(absf, absf); // is this a NaN? 2403 b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special? 2404 nanbit = _mm_and_si128(_mm_castps_si128(b_isnan), c_nanbit); 2405 inf_or_nan = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials 2406 2407 b_issub = _mm_cmpgt_epi32(min_normal, absf_int); 2408 2409 // "result is subnormal" path 2410 subnorm1 = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa 2411 subnorm2 = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias 2412 2413 // "result is normal" path 2414 mantoddbit = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign 2415 mantodd = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0 2416 2417 round1 = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias)); 2418 round2 = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE) 2419 normal = _mm_srli_epi32(round2, 13); // rounded result 2420 2421 // combine the two non-specials 2422 nonspecial = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal)); 2423 2424 // merge in specials as well 2425 joined = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan)); 2426 2427 sign_shift = _mm_srai_epi32(_mm_castps_si128(justsign), 16); 2428 final2 = _mm_or_si128(joined, sign_shift); 2429 final = _mm_packs_epi32(final, final2); 2430 stbir__simdi_store( output,final ); 2431 } 2432 2433 #elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) && !defined(__clang__) // 64-bit ARM on MSVC (not clang) 2434 2435 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) 2436 { 2437 float16x4_t in0 = vld1_f16(input + 0); 2438 float16x4_t in1 = vld1_f16(input + 4); 2439 vst1q_f32(output + 0, vcvt_f32_f16(in0)); 2440 vst1q_f32(output + 4, vcvt_f32_f16(in1)); 2441 } 2442 2443 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input) 2444 { 2445 float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0)); 2446 float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4)); 2447 vst1_f16(output+0, out0); 2448 vst1_f16(output+4, out1); 2449 } 2450 2451 static stbir__inline float stbir__half_to_float( stbir__FP16 h ) 2452 { 2453 return vgetq_lane_f32(vcvt_f32_f16(vld1_dup_f16(&h)), 0); 2454 } 2455 2456 static stbir__inline stbir__FP16 stbir__float_to_half( float f ) 2457 { 2458 return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0]; 2459 } 2460 2461 #elif defined(STBIR_NEON) && ( defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) ) // 64-bit ARM 2462 2463 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) 2464 { 2465 float16x8_t in = vld1q_f16(input); 2466 vst1q_f32(output + 0, vcvt_f32_f16(vget_low_f16(in))); 2467 vst1q_f32(output + 4, vcvt_f32_f16(vget_high_f16(in))); 2468 } 2469 2470 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input) 2471 { 2472 float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0)); 2473 float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4)); 2474 vst1q_f16(output, vcombine_f16(out0, out1)); 2475 } 2476 2477 static stbir__inline float stbir__half_to_float( stbir__FP16 h ) 2478 { 2479 return vgetq_lane_f32(vcvt_f32_f16(vdup_n_f16(h)), 0); 2480 } 2481 2482 static stbir__inline stbir__FP16 stbir__float_to_half( float f ) 2483 { 2484 return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0); 2485 } 2486 2487 #elif defined(STBIR_WASM) || (defined(STBIR_NEON) && (defined(_MSC_VER) || defined(_M_ARM) || defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang 2488 2489 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) 2490 { 2491 for (int i=0; i<8; i++) 2492 { 2493 output[i] = stbir__half_to_float(input[i]); 2494 } 2495 } 2496 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input) 2497 { 2498 for (int i=0; i<8; i++) 2499 { 2500 output[i] = stbir__float_to_half(input[i]); 2501 } 2502 } 2503 2504 #endif 2505 2506 2507 #ifdef STBIR_SIMD 2508 2509 #define stbir__simdf_0123to3333( out, reg ) (out) = stbir__simdf_swiz( reg, 3,3,3,3 ) 2510 #define stbir__simdf_0123to2222( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,2,2 ) 2511 #define stbir__simdf_0123to1111( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,1,1 ) 2512 #define stbir__simdf_0123to0000( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,0 ) 2513 #define stbir__simdf_0123to0003( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,3 ) 2514 #define stbir__simdf_0123to0001( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,1 ) 2515 #define stbir__simdf_0123to1122( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,2,2 ) 2516 #define stbir__simdf_0123to2333( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,3,3 ) 2517 #define stbir__simdf_0123to0023( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,3 ) 2518 #define stbir__simdf_0123to1230( out, reg ) (out) = stbir__simdf_swiz( reg, 1,2,3,0 ) 2519 #define stbir__simdf_0123to2103( out, reg ) (out) = stbir__simdf_swiz( reg, 2,1,0,3 ) 2520 #define stbir__simdf_0123to3210( out, reg ) (out) = stbir__simdf_swiz( reg, 3,2,1,0 ) 2521 #define stbir__simdf_0123to2301( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,0,1 ) 2522 #define stbir__simdf_0123to3012( out, reg ) (out) = stbir__simdf_swiz( reg, 3,0,1,2 ) 2523 #define stbir__simdf_0123to0011( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,1,1 ) 2524 #define stbir__simdf_0123to1100( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,0,0 ) 2525 #define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 ) 2526 #define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 ) 2527 #define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 ) 2528 #define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 ) 2529 2530 typedef union stbir__simdi_u32 2531 { 2532 stbir_uint32 m128i_u32[4]; 2533 int m128i_i32[4]; 2534 stbir__simdi m128i_i128; 2535 } stbir__simdi_u32; 2536 2537 static const int STBIR_mask[9] = { 0,0,0,-1,-1,-1,0,0,0 }; 2538 2539 static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float, stbir__max_uint8_as_float); 2540 static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float, stbir__max_uint16_as_float); 2541 static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float_inverted, stbir__max_uint8_as_float_inverted); 2542 static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted); 2543 2544 static const STBIR__SIMDF_CONST(STBIR_simd_point5, 0.5f); 2545 static const STBIR__SIMDF_CONST(STBIR_ones, 1.0f); 2546 static const STBIR__SIMDI_CONST(STBIR_almost_zero, (127 - 13) << 23); 2547 static const STBIR__SIMDI_CONST(STBIR_almost_one, 0x3f7fffff); 2548 static const STBIR__SIMDI_CONST(STBIR_mastissa_mask, 0xff); 2549 static const STBIR__SIMDI_CONST(STBIR_topscale, 0x02000000); 2550 2551 // Basically, in simd mode, we unroll the proper amount, and we don't want 2552 // the non-simd remnant loops to be unroll because they only run a few times 2553 // Adding this switch saves about 5K on clang which is Captain Unroll the 3rd. 2554 #define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star ) 2555 #define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr) 2556 #define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START 2557 #define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START_INF_FOR 2558 2559 #ifdef STBIR_MEMCPY 2560 #undef STBIR_MEMCPY 2561 #endif 2562 #define STBIR_MEMCPY stbir_simd_memcpy 2563 2564 // override normal use of memcpy with much simpler copy (faster and smaller with our sized copies) 2565 static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) 2566 { 2567 char STBIR_SIMD_STREAMOUT_PTR (*) d = (char*) dest; 2568 char STBIR_SIMD_STREAMOUT_PTR( * ) d_end = ((char*) dest) + bytes; 2569 ptrdiff_t ofs_to_src = (char*)src - (char*)dest; 2570 2571 // check overlaps 2572 STBIR_ASSERT( ( ( d >= ( (char*)src) + bytes ) ) || ( ( d + bytes ) <= (char*)src ) ); 2573 2574 if ( bytes < (16*stbir__simdfX_float_count) ) 2575 { 2576 if ( bytes < 16 ) 2577 { 2578 if ( bytes ) 2579 { 2580 STBIR_SIMD_NO_UNROLL_LOOP_START 2581 do 2582 { 2583 STBIR_SIMD_NO_UNROLL(d); 2584 d[ 0 ] = d[ ofs_to_src ]; 2585 ++d; 2586 } while ( d < d_end ); 2587 } 2588 } 2589 else 2590 { 2591 stbir__simdf x; 2592 // do one unaligned to get us aligned for the stream out below 2593 stbir__simdf_load( x, ( d + ofs_to_src ) ); 2594 stbir__simdf_store( d, x ); 2595 d = (char*)( ( ( (size_t)d ) + 16 ) & ~15 ); 2596 2597 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 2598 for(;;) 2599 { 2600 STBIR_SIMD_NO_UNROLL(d); 2601 2602 if ( d > ( d_end - 16 ) ) 2603 { 2604 if ( d == d_end ) 2605 return; 2606 d = d_end - 16; 2607 } 2608 2609 stbir__simdf_load( x, ( d + ofs_to_src ) ); 2610 stbir__simdf_store( d, x ); 2611 d += 16; 2612 } 2613 } 2614 } 2615 else 2616 { 2617 stbir__simdfX x0,x1,x2,x3; 2618 2619 // do one unaligned to get us aligned for the stream out below 2620 stbir__simdfX_load( x0, ( d + ofs_to_src ) + 0*stbir__simdfX_float_count ); 2621 stbir__simdfX_load( x1, ( d + ofs_to_src ) + 4*stbir__simdfX_float_count ); 2622 stbir__simdfX_load( x2, ( d + ofs_to_src ) + 8*stbir__simdfX_float_count ); 2623 stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count ); 2624 stbir__simdfX_store( d + 0*stbir__simdfX_float_count, x0 ); 2625 stbir__simdfX_store( d + 4*stbir__simdfX_float_count, x1 ); 2626 stbir__simdfX_store( d + 8*stbir__simdfX_float_count, x2 ); 2627 stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 ); 2628 d = (char*)( ( ( (size_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) ); 2629 2630 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 2631 for(;;) 2632 { 2633 STBIR_SIMD_NO_UNROLL(d); 2634 2635 if ( d > ( d_end - (16*stbir__simdfX_float_count) ) ) 2636 { 2637 if ( d == d_end ) 2638 return; 2639 d = d_end - (16*stbir__simdfX_float_count); 2640 } 2641 2642 stbir__simdfX_load( x0, ( d + ofs_to_src ) + 0*stbir__simdfX_float_count ); 2643 stbir__simdfX_load( x1, ( d + ofs_to_src ) + 4*stbir__simdfX_float_count ); 2644 stbir__simdfX_load( x2, ( d + ofs_to_src ) + 8*stbir__simdfX_float_count ); 2645 stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count ); 2646 stbir__simdfX_store( d + 0*stbir__simdfX_float_count, x0 ); 2647 stbir__simdfX_store( d + 4*stbir__simdfX_float_count, x1 ); 2648 stbir__simdfX_store( d + 8*stbir__simdfX_float_count, x2 ); 2649 stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 ); 2650 d += (16*stbir__simdfX_float_count); 2651 } 2652 } 2653 } 2654 2655 // memcpy that is specically intentionally overlapping (src is smaller then dest, so can be 2656 // a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to 2657 // the diff between dest and src) 2658 static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) 2659 { 2660 char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src; 2661 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes; 2662 ptrdiff_t ofs_to_dest = (char*)dest - (char*)src; 2663 2664 if ( ofs_to_dest >= 16 ) // is the overlap more than 16 away? 2665 { 2666 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end16 = ((char*) src) + (bytes&~15); 2667 STBIR_SIMD_NO_UNROLL_LOOP_START 2668 do 2669 { 2670 stbir__simdf x; 2671 STBIR_SIMD_NO_UNROLL(sd); 2672 stbir__simdf_load( x, sd ); 2673 stbir__simdf_store( ( sd + ofs_to_dest ), x ); 2674 sd += 16; 2675 } while ( sd < s_end16 ); 2676 2677 if ( sd == s_end ) 2678 return; 2679 } 2680 2681 do 2682 { 2683 STBIR_SIMD_NO_UNROLL(sd); 2684 *(int*)( sd + ofs_to_dest ) = *(int*) sd; 2685 sd += 4; 2686 } while ( sd < s_end ); 2687 } 2688 2689 #else // no SSE2 2690 2691 // when in scalar mode, we let unrolling happen, so this macro just does the __restrict 2692 #define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star ) 2693 #define STBIR_SIMD_NO_UNROLL(ptr) 2694 #define STBIR_SIMD_NO_UNROLL_LOOP_START 2695 #define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 2696 2697 #endif // SSE2 2698 2699 2700 #ifdef STBIR_PROFILE 2701 2702 #ifndef STBIR_PROFILE_FUNC 2703 2704 #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) || defined( _M_IX86_FP ) || defined(__i386) || defined( __i386__ ) || defined( _M_IX86 ) || defined( _X86_ ) 2705 2706 #ifdef _MSC_VER 2707 2708 STBIRDEF stbir_uint64 __rdtsc(); 2709 #define STBIR_PROFILE_FUNC() __rdtsc() 2710 2711 #else // non msvc 2712 2713 static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC() 2714 { 2715 stbir_uint32 lo, hi; 2716 asm volatile ("rdtsc" : "=a" (lo), "=d" (hi) ); 2717 return ( ( (stbir_uint64) hi ) << 32 ) | ( (stbir_uint64) lo ); 2718 } 2719 2720 #endif // msvc 2721 2722 #elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__) 2723 2724 #if defined( _MSC_VER ) && !defined(__clang__) 2725 2726 #define STBIR_PROFILE_FUNC() _ReadStatusReg(ARM64_CNTVCT) 2727 2728 #else 2729 2730 static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC() 2731 { 2732 stbir_uint64 tsc; 2733 asm volatile("mrs %0, cntvct_el0" : "=r" (tsc)); 2734 return tsc; 2735 } 2736 2737 #endif 2738 2739 #else // x64, arm 2740 2741 #error Unknown platform for profiling. 2742 2743 #endif // x64, arm 2744 2745 #endif // STBIR_PROFILE_FUNC 2746 2747 #define STBIR_ONLY_PROFILE_GET_SPLIT_INFO ,stbir__per_split_info * split_info 2748 #define STBIR_ONLY_PROFILE_SET_SPLIT_INFO ,split_info 2749 2750 #define STBIR_ONLY_PROFILE_BUILD_GET_INFO ,stbir__info * profile_info 2751 #define STBIR_ONLY_PROFILE_BUILD_SET_INFO ,profile_info 2752 2753 // super light-weight micro profiler 2754 #define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded; 2755 #define STBIR_PROFILE_END_ll( info, wh ) wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime; info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded; *wh##save_parent_excluded_ptr += wh##thiszonetime; info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr; } 2756 #define STBIR_PROFILE_FIRST_START_ll( info, wh ) { int i; info->current_zone_excluded_ptr = &info->profile.named.total; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; } STBIR_PROFILE_START_ll( info, wh ); 2757 #define STBIR_PROFILE_CLEAR_EXTRAS_ll( info, num ) { int extra; for(extra=1;extra<(num);extra++) { int i; for(i=0;i<STBIR__ARRAY_SIZE((info)->profile.array);i++) (info)[extra].profile.array[i]=0; } } 2758 2759 // for thread data 2760 #define STBIR_PROFILE_START( wh ) STBIR_PROFILE_START_ll( split_info, wh ) 2761 #define STBIR_PROFILE_END( wh ) STBIR_PROFILE_END_ll( split_info, wh ) 2762 #define STBIR_PROFILE_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( split_info, wh ) 2763 #define STBIR_PROFILE_CLEAR_EXTRAS() STBIR_PROFILE_CLEAR_EXTRAS_ll( split_info, split_count ) 2764 2765 // for build data 2766 #define STBIR_PROFILE_BUILD_START( wh ) STBIR_PROFILE_START_ll( profile_info, wh ) 2767 #define STBIR_PROFILE_BUILD_END( wh ) STBIR_PROFILE_END_ll( profile_info, wh ) 2768 #define STBIR_PROFILE_BUILD_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( profile_info, wh ) 2769 #define STBIR_PROFILE_BUILD_CLEAR( info ) { int i; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; } 2770 2771 #else // no profile 2772 2773 #define STBIR_ONLY_PROFILE_GET_SPLIT_INFO 2774 #define STBIR_ONLY_PROFILE_SET_SPLIT_INFO 2775 2776 #define STBIR_ONLY_PROFILE_BUILD_GET_INFO 2777 #define STBIR_ONLY_PROFILE_BUILD_SET_INFO 2778 2779 #define STBIR_PROFILE_START( wh ) 2780 #define STBIR_PROFILE_END( wh ) 2781 #define STBIR_PROFILE_FIRST_START( wh ) 2782 #define STBIR_PROFILE_CLEAR_EXTRAS( ) 2783 2784 #define STBIR_PROFILE_BUILD_START( wh ) 2785 #define STBIR_PROFILE_BUILD_END( wh ) 2786 #define STBIR_PROFILE_BUILD_FIRST_START( wh ) 2787 #define STBIR_PROFILE_BUILD_CLEAR( info ) 2788 2789 #endif // stbir_profile 2790 2791 #ifndef STBIR_CEILF 2792 #include <math.h> 2793 #if _MSC_VER <= 1200 // support VC6 for Sean 2794 #define STBIR_CEILF(x) ((float)ceil((float)(x))) 2795 #define STBIR_FLOORF(x) ((float)floor((float)(x))) 2796 #else 2797 #define STBIR_CEILF(x) ceilf(x) 2798 #define STBIR_FLOORF(x) floorf(x) 2799 #endif 2800 #endif 2801 2802 #ifndef STBIR_MEMCPY 2803 // For memcpy 2804 #include <string.h> 2805 #define STBIR_MEMCPY( dest, src, len ) memcpy( dest, src, len ) 2806 #endif 2807 2808 #ifndef STBIR_SIMD 2809 2810 // memcpy that is specifically intentionally overlapping (src is smaller then dest, so can be 2811 // a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to 2812 // the diff between dest and src) 2813 static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) 2814 { 2815 char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src; 2816 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes; 2817 ptrdiff_t ofs_to_dest = (char*)dest - (char*)src; 2818 2819 if ( ofs_to_dest >= 8 ) // is the overlap more than 8 away? 2820 { 2821 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end8 = ((char*) src) + (bytes&~7); 2822 STBIR_NO_UNROLL_LOOP_START 2823 do 2824 { 2825 STBIR_NO_UNROLL(sd); 2826 *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd; 2827 sd += 8; 2828 } while ( sd < s_end8 ); 2829 2830 if ( sd == s_end ) 2831 return; 2832 } 2833 2834 STBIR_NO_UNROLL_LOOP_START 2835 do 2836 { 2837 STBIR_NO_UNROLL(sd); 2838 *(int*)( sd + ofs_to_dest ) = *(int*) sd; 2839 sd += 4; 2840 } while ( sd < s_end ); 2841 } 2842 2843 #endif 2844 2845 static float stbir__filter_trapezoid(float x, float scale, void * user_data) 2846 { 2847 float halfscale = scale / 2; 2848 float t = 0.5f + halfscale; 2849 STBIR_ASSERT(scale <= 1); 2850 STBIR__UNUSED(user_data); 2851 2852 if ( x < 0.0f ) x = -x; 2853 2854 if (x >= t) 2855 return 0.0f; 2856 else 2857 { 2858 float r = 0.5f - halfscale; 2859 if (x <= r) 2860 return 1.0f; 2861 else 2862 return (t - x) / scale; 2863 } 2864 } 2865 2866 static float stbir__support_trapezoid(float scale, void * user_data) 2867 { 2868 STBIR__UNUSED(user_data); 2869 return 0.5f + scale / 2.0f; 2870 } 2871 2872 static float stbir__filter_triangle(float x, float s, void * user_data) 2873 { 2874 STBIR__UNUSED(s); 2875 STBIR__UNUSED(user_data); 2876 2877 if ( x < 0.0f ) x = -x; 2878 2879 if (x <= 1.0f) 2880 return 1.0f - x; 2881 else 2882 return 0.0f; 2883 } 2884 2885 static float stbir__filter_point(float x, float s, void * user_data) 2886 { 2887 STBIR__UNUSED(x); 2888 STBIR__UNUSED(s); 2889 STBIR__UNUSED(user_data); 2890 2891 return 1.0f; 2892 } 2893 2894 static float stbir__filter_cubic(float x, float s, void * user_data) 2895 { 2896 STBIR__UNUSED(s); 2897 STBIR__UNUSED(user_data); 2898 2899 if ( x < 0.0f ) x = -x; 2900 2901 if (x < 1.0f) 2902 return (4.0f + x*x*(3.0f*x - 6.0f))/6.0f; 2903 else if (x < 2.0f) 2904 return (8.0f + x*(-12.0f + x*(6.0f - x)))/6.0f; 2905 2906 return (0.0f); 2907 } 2908 2909 static float stbir__filter_catmullrom(float x, float s, void * user_data) 2910 { 2911 STBIR__UNUSED(s); 2912 STBIR__UNUSED(user_data); 2913 2914 if ( x < 0.0f ) x = -x; 2915 2916 if (x < 1.0f) 2917 return 1.0f - x*x*(2.5f - 1.5f*x); 2918 else if (x < 2.0f) 2919 return 2.0f - x*(4.0f + x*(0.5f*x - 2.5f)); 2920 2921 return (0.0f); 2922 } 2923 2924 static float stbir__filter_mitchell(float x, float s, void * user_data) 2925 { 2926 STBIR__UNUSED(s); 2927 STBIR__UNUSED(user_data); 2928 2929 if ( x < 0.0f ) x = -x; 2930 2931 if (x < 1.0f) 2932 return (16.0f + x*x*(21.0f * x - 36.0f))/18.0f; 2933 else if (x < 2.0f) 2934 return (32.0f + x*(-60.0f + x*(36.0f - 7.0f*x)))/18.0f; 2935 2936 return (0.0f); 2937 } 2938 2939 static float stbir__support_zeropoint5(float s, void * user_data) 2940 { 2941 STBIR__UNUSED(s); 2942 STBIR__UNUSED(user_data); 2943 return 0.5f; 2944 } 2945 2946 static float stbir__support_one(float s, void * user_data) 2947 { 2948 STBIR__UNUSED(s); 2949 STBIR__UNUSED(user_data); 2950 return 1; 2951 } 2952 2953 static float stbir__support_two(float s, void * user_data) 2954 { 2955 STBIR__UNUSED(s); 2956 STBIR__UNUSED(user_data); 2957 return 2; 2958 } 2959 2960 // This is the maximum number of input samples that can affect an output sample 2961 // with the given filter from the output pixel's perspective 2962 static int stbir__get_filter_pixel_width(stbir__support_callback * support, float scale, void * user_data) 2963 { 2964 STBIR_ASSERT(support != 0); 2965 2966 if ( scale >= ( 1.0f-stbir__small_float ) ) // upscale 2967 return (int)STBIR_CEILF(support(1.0f/scale,user_data) * 2.0f); 2968 else 2969 return (int)STBIR_CEILF(support(scale,user_data) * 2.0f / scale); 2970 } 2971 2972 // this is how many coefficents per run of the filter (which is different 2973 // from the filter_pixel_width depending on if we are scattering or gathering) 2974 static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, void * user_data) 2975 { 2976 float scale = samp->scale_info.scale; 2977 stbir__support_callback * support = samp->filter_support; 2978 2979 switch( is_gather ) 2980 { 2981 case 1: 2982 return (int)STBIR_CEILF(support(1.0f / scale, user_data) * 2.0f); 2983 case 2: 2984 return (int)STBIR_CEILF(support(scale, user_data) * 2.0f / scale); 2985 case 0: 2986 return (int)STBIR_CEILF(support(scale, user_data) * 2.0f); 2987 default: 2988 STBIR_ASSERT( (is_gather >= 0 ) && (is_gather <= 2 ) ); 2989 return 0; 2990 } 2991 } 2992 2993 static int stbir__get_contributors(stbir__sampler * samp, int is_gather) 2994 { 2995 if (is_gather) 2996 return samp->scale_info.output_sub_size; 2997 else 2998 return (samp->scale_info.input_full_size + samp->filter_pixel_margin * 2); 2999 } 3000 3001 static int stbir__edge_zero_full( int n, int max ) 3002 { 3003 STBIR__UNUSED(n); 3004 STBIR__UNUSED(max); 3005 return 0; // NOTREACHED 3006 } 3007 3008 static int stbir__edge_clamp_full( int n, int max ) 3009 { 3010 if (n < 0) 3011 return 0; 3012 3013 if (n >= max) 3014 return max - 1; 3015 3016 return n; // NOTREACHED 3017 } 3018 3019 static int stbir__edge_reflect_full( int n, int max ) 3020 { 3021 if (n < 0) 3022 { 3023 if (n > -max) 3024 return -n; 3025 else 3026 return max - 1; 3027 } 3028 3029 if (n >= max) 3030 { 3031 int max2 = max * 2; 3032 if (n >= max2) 3033 return 0; 3034 else 3035 return max2 - n - 1; 3036 } 3037 3038 return n; // NOTREACHED 3039 } 3040 3041 static int stbir__edge_wrap_full( int n, int max ) 3042 { 3043 if (n >= 0) 3044 return (n % max); 3045 else 3046 { 3047 int m = (-n) % max; 3048 3049 if (m != 0) 3050 m = max - m; 3051 3052 return (m); 3053 } 3054 } 3055 3056 typedef int stbir__edge_wrap_func( int n, int max ); 3057 static stbir__edge_wrap_func * stbir__edge_wrap_slow[] = 3058 { 3059 stbir__edge_clamp_full, // STBIR_EDGE_CLAMP 3060 stbir__edge_reflect_full, // STBIR_EDGE_REFLECT 3061 stbir__edge_wrap_full, // STBIR_EDGE_WRAP 3062 stbir__edge_zero_full, // STBIR_EDGE_ZERO 3063 }; 3064 3065 stbir__inline static int stbir__edge_wrap(stbir_edge edge, int n, int max) 3066 { 3067 // avoid per-pixel switch 3068 if (n >= 0 && n < max) 3069 return n; 3070 return stbir__edge_wrap_slow[edge]( n, max ); 3071 } 3072 3073 #define STBIR__MERGE_RUNS_PIXEL_THRESHOLD 16 3074 3075 // get information on the extents of a sampler 3076 static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline_extents ) 3077 { 3078 int j, stop; 3079 int left_margin, right_margin; 3080 int min_n = 0x7fffffff, max_n = -0x7fffffff; 3081 int min_left = 0x7fffffff, max_left = -0x7fffffff; 3082 int min_right = 0x7fffffff, max_right = -0x7fffffff; 3083 stbir_edge edge = samp->edge; 3084 stbir__contributors* contributors = samp->contributors; 3085 int output_sub_size = samp->scale_info.output_sub_size; 3086 int input_full_size = samp->scale_info.input_full_size; 3087 int filter_pixel_margin = samp->filter_pixel_margin; 3088 3089 STBIR_ASSERT( samp->is_gather ); 3090 3091 stop = output_sub_size; 3092 for (j = 0; j < stop; j++ ) 3093 { 3094 STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 ); 3095 if ( contributors[j].n0 < min_n ) 3096 { 3097 min_n = contributors[j].n0; 3098 stop = j + filter_pixel_margin; // if we find a new min, only scan another filter width 3099 if ( stop > output_sub_size ) stop = output_sub_size; 3100 } 3101 } 3102 3103 stop = 0; 3104 for (j = output_sub_size - 1; j >= stop; j-- ) 3105 { 3106 STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 ); 3107 if ( contributors[j].n1 > max_n ) 3108 { 3109 max_n = contributors[j].n1; 3110 stop = j - filter_pixel_margin; // if we find a new max, only scan another filter width 3111 if (stop<0) stop = 0; 3112 } 3113 } 3114 3115 STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n ); 3116 STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n ); 3117 3118 // now calculate how much into the margins we really read 3119 left_margin = 0; 3120 if ( min_n < 0 ) 3121 { 3122 left_margin = -min_n; 3123 min_n = 0; 3124 } 3125 3126 right_margin = 0; 3127 if ( max_n >= input_full_size ) 3128 { 3129 right_margin = max_n - input_full_size + 1; 3130 max_n = input_full_size - 1; 3131 } 3132 3133 // index 1 is margin pixel extents (how many pixels we hang over the edge) 3134 scanline_extents->edge_sizes[0] = left_margin; 3135 scanline_extents->edge_sizes[1] = right_margin; 3136 3137 // index 2 is pixels read from the input 3138 scanline_extents->spans[0].n0 = min_n; 3139 scanline_extents->spans[0].n1 = max_n; 3140 scanline_extents->spans[0].pixel_offset_for_input = min_n; 3141 3142 // default to no other input range 3143 scanline_extents->spans[1].n0 = 0; 3144 scanline_extents->spans[1].n1 = -1; 3145 scanline_extents->spans[1].pixel_offset_for_input = 0; 3146 3147 // don't have to do edge calc for zero clamp 3148 if ( edge == STBIR_EDGE_ZERO ) 3149 return; 3150 3151 // convert margin pixels to the pixels within the input (min and max) 3152 for( j = -left_margin ; j < 0 ; j++ ) 3153 { 3154 int p = stbir__edge_wrap( edge, j, input_full_size ); 3155 if ( p < min_left ) 3156 min_left = p; 3157 if ( p > max_left ) 3158 max_left = p; 3159 } 3160 3161 for( j = input_full_size ; j < (input_full_size + right_margin) ; j++ ) 3162 { 3163 int p = stbir__edge_wrap( edge, j, input_full_size ); 3164 if ( p < min_right ) 3165 min_right = p; 3166 if ( p > max_right ) 3167 max_right = p; 3168 } 3169 3170 // merge the left margin pixel region if it connects within 4 pixels of main pixel region 3171 if ( min_left != 0x7fffffff ) 3172 { 3173 if ( ( ( min_left <= min_n ) && ( ( max_left + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) || 3174 ( ( min_n <= min_left ) && ( ( max_n + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_left ) ) ) 3175 { 3176 scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_left ); 3177 scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_left ); 3178 scanline_extents->spans[0].pixel_offset_for_input = min_n; 3179 left_margin = 0; 3180 } 3181 } 3182 3183 // merge the right margin pixel region if it connects within 4 pixels of main pixel region 3184 if ( min_right != 0x7fffffff ) 3185 { 3186 if ( ( ( min_right <= min_n ) && ( ( max_right + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) || 3187 ( ( min_n <= min_right ) && ( ( max_n + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_right ) ) ) 3188 { 3189 scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_right ); 3190 scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_right ); 3191 scanline_extents->spans[0].pixel_offset_for_input = min_n; 3192 right_margin = 0; 3193 } 3194 } 3195 3196 STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n ); 3197 STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n ); 3198 3199 // you get two ranges when you have the WRAP edge mode and you are doing just the a piece of the resize 3200 // so you need to get a second run of pixels from the opposite side of the scanline (which you 3201 // wouldn't need except for WRAP) 3202 3203 3204 // if we can't merge the min_left range, add it as a second range 3205 if ( ( left_margin ) && ( min_left != 0x7fffffff ) ) 3206 { 3207 stbir__span * newspan = scanline_extents->spans + 1; 3208 STBIR_ASSERT( right_margin == 0 ); 3209 if ( min_left < scanline_extents->spans[0].n0 ) 3210 { 3211 scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0; 3212 scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0; 3213 scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1; 3214 --newspan; 3215 } 3216 newspan->pixel_offset_for_input = min_left; 3217 newspan->n0 = -left_margin; 3218 newspan->n1 = ( max_left - min_left ) - left_margin; 3219 scanline_extents->edge_sizes[0] = 0; // don't need to copy the left margin, since we are directly decoding into the margin 3220 return; 3221 } 3222 3223 // if we can't merge the min_left range, add it as a second range 3224 if ( ( right_margin ) && ( min_right != 0x7fffffff ) ) 3225 { 3226 stbir__span * newspan = scanline_extents->spans + 1; 3227 if ( min_right < scanline_extents->spans[0].n0 ) 3228 { 3229 scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0; 3230 scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0; 3231 scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1; 3232 --newspan; 3233 } 3234 newspan->pixel_offset_for_input = min_right; 3235 newspan->n0 = scanline_extents->spans[1].n1 + 1; 3236 newspan->n1 = scanline_extents->spans[1].n1 + 1 + ( max_right - min_right ); 3237 scanline_extents->edge_sizes[1] = 0; // don't need to copy the right margin, since we are directly decoding into the margin 3238 return; 3239 } 3240 } 3241 3242 static void stbir__calculate_in_pixel_range( int * first_pixel, int * last_pixel, float out_pixel_center, float out_filter_radius, float inv_scale, float out_shift, int input_size, stbir_edge edge ) 3243 { 3244 int first, last; 3245 float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius; 3246 float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius; 3247 3248 float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale; 3249 float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale; 3250 3251 first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f)); 3252 last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f)); 3253 if ( last < first ) last = first; // point sample mode can span a value *right* at 0.5, and cause these to cross 3254 3255 if ( edge == STBIR_EDGE_WRAP ) 3256 { 3257 if ( first < -input_size ) 3258 first = -input_size; 3259 if ( last >= (input_size*2)) 3260 last = (input_size*2) - 1; 3261 } 3262 3263 *first_pixel = first; 3264 *last_pixel = last; 3265 } 3266 3267 static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float* coefficient_group, int coefficient_width, stbir_edge edge, void * user_data ) 3268 { 3269 int n, end; 3270 float inv_scale = scale_info->inv_scale; 3271 float out_shift = scale_info->pixel_shift; 3272 int input_size = scale_info->input_full_size; 3273 int numerator = scale_info->scale_numerator; 3274 int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) ); 3275 3276 // Looping through out pixels 3277 end = num_contributors; if ( polyphase ) end = numerator; 3278 for (n = 0; n < end; n++) 3279 { 3280 int i; 3281 int last_non_zero; 3282 float out_pixel_center = (float)n + 0.5f; 3283 float in_center_of_out = (out_pixel_center + out_shift) * inv_scale; 3284 3285 int in_first_pixel, in_last_pixel; 3286 3287 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, out_pixel_center, out_filter_radius, inv_scale, out_shift, input_size, edge ); 3288 3289 // make sure we never generate a range larger than our precalculated coeff width 3290 // this only happens in point sample mode, but it's a good safe thing to do anyway 3291 if ( ( in_last_pixel - in_first_pixel + 1 ) > coefficient_width ) 3292 in_last_pixel = in_first_pixel + coefficient_width - 1; 3293 3294 last_non_zero = -1; 3295 for (i = 0; i <= in_last_pixel - in_first_pixel; i++) 3296 { 3297 float in_pixel_center = (float)(i + in_first_pixel) + 0.5f; 3298 float coeff = kernel(in_center_of_out - in_pixel_center, inv_scale, user_data); 3299 3300 // kill denormals 3301 if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) ) 3302 { 3303 if ( i == 0 ) // if we're at the front, just eat zero contributors 3304 { 3305 STBIR_ASSERT ( ( in_last_pixel - in_first_pixel ) != 0 ); // there should be at least one contrib 3306 ++in_first_pixel; 3307 i--; 3308 continue; 3309 } 3310 coeff = 0; // make sure is fully zero (should keep denormals away) 3311 } 3312 else 3313 last_non_zero = i; 3314 3315 coefficient_group[i] = coeff; 3316 } 3317 3318 in_last_pixel = last_non_zero+in_first_pixel; // kills trailing zeros 3319 contributors->n0 = in_first_pixel; 3320 contributors->n1 = in_last_pixel; 3321 3322 STBIR_ASSERT(contributors->n1 >= contributors->n0); 3323 3324 ++contributors; 3325 coefficient_group += coefficient_width; 3326 } 3327 } 3328 3329 static void stbir__insert_coeff( stbir__contributors * contribs, float * coeffs, int new_pixel, float new_coeff, int max_width ) 3330 { 3331 if ( new_pixel <= contribs->n1 ) // before the end 3332 { 3333 if ( new_pixel < contribs->n0 ) // before the front? 3334 { 3335 if ( ( contribs->n1 - new_pixel + 1 ) <= max_width ) 3336 { 3337 int j, o = contribs->n0 - new_pixel; 3338 for ( j = contribs->n1 - contribs->n0 ; j <= 0 ; j-- ) 3339 coeffs[ j + o ] = coeffs[ j ]; 3340 for ( j = 1 ; j < o ; j-- ) 3341 coeffs[ j ] = coeffs[ 0 ]; 3342 coeffs[ 0 ] = new_coeff; 3343 contribs->n0 = new_pixel; 3344 } 3345 } 3346 else 3347 { 3348 coeffs[ new_pixel - contribs->n0 ] += new_coeff; 3349 } 3350 } 3351 else 3352 { 3353 if ( ( new_pixel - contribs->n0 + 1 ) <= max_width ) 3354 { 3355 int j, e = new_pixel - contribs->n0; 3356 for( j = ( contribs->n1 - contribs->n0 ) + 1 ; j < e ; j++ ) // clear in-betweens coeffs if there are any 3357 coeffs[j] = 0; 3358 3359 coeffs[ e ] = new_coeff; 3360 contribs->n1 = new_pixel; 3361 } 3362 } 3363 } 3364 3365 static void stbir__calculate_out_pixel_range( int * first_pixel, int * last_pixel, float in_pixel_center, float in_pixels_radius, float scale, float out_shift, int out_size ) 3366 { 3367 float in_pixel_influence_lowerbound = in_pixel_center - in_pixels_radius; 3368 float in_pixel_influence_upperbound = in_pixel_center + in_pixels_radius; 3369 float out_pixel_influence_lowerbound = in_pixel_influence_lowerbound * scale - out_shift; 3370 float out_pixel_influence_upperbound = in_pixel_influence_upperbound * scale - out_shift; 3371 int out_first_pixel = (int)(STBIR_FLOORF(out_pixel_influence_lowerbound + 0.5f)); 3372 int out_last_pixel = (int)(STBIR_FLOORF(out_pixel_influence_upperbound - 0.5f)); 3373 3374 if ( out_first_pixel < 0 ) 3375 out_first_pixel = 0; 3376 if ( out_last_pixel >= out_size ) 3377 out_last_pixel = out_size - 1; 3378 *first_pixel = out_first_pixel; 3379 *last_pixel = out_last_pixel; 3380 } 3381 3382 static void stbir__calculate_coefficients_for_gather_downsample( int start, int end, float in_pixels_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int coefficient_width, int num_contributors, stbir__contributors * contributors, float * coefficient_group, void * user_data ) 3383 { 3384 int in_pixel; 3385 int i; 3386 int first_out_inited = -1; 3387 float scale = scale_info->scale; 3388 float out_shift = scale_info->pixel_shift; 3389 int out_size = scale_info->output_sub_size; 3390 int numerator = scale_info->scale_numerator; 3391 int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < out_size ) ); 3392 3393 STBIR__UNUSED(num_contributors); 3394 3395 // Loop through the input pixels 3396 for (in_pixel = start; in_pixel < end; in_pixel++) 3397 { 3398 float in_pixel_center = (float)in_pixel + 0.5f; 3399 float out_center_of_in = in_pixel_center * scale - out_shift; 3400 int out_first_pixel, out_last_pixel; 3401 3402 stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, in_pixel_center, in_pixels_radius, scale, out_shift, out_size ); 3403 3404 if ( out_first_pixel > out_last_pixel ) 3405 continue; 3406 3407 // clamp or exit if we are using polyphase filtering, and the limit is up 3408 if ( polyphase ) 3409 { 3410 // when polyphase, you only have to do coeffs up to the numerator count 3411 if ( out_first_pixel == numerator ) 3412 break; 3413 3414 // don't do any extra work, clamp last pixel at numerator too 3415 if ( out_last_pixel >= numerator ) 3416 out_last_pixel = numerator - 1; 3417 } 3418 3419 for (i = 0; i <= out_last_pixel - out_first_pixel; i++) 3420 { 3421 float out_pixel_center = (float)(i + out_first_pixel) + 0.5f; 3422 float x = out_pixel_center - out_center_of_in; 3423 float coeff = kernel(x, scale, user_data) * scale; 3424 3425 // kill the coeff if it's too small (avoid denormals) 3426 if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) ) 3427 coeff = 0.0f; 3428 3429 { 3430 int out = i + out_first_pixel; 3431 float * coeffs = coefficient_group + out * coefficient_width; 3432 stbir__contributors * contribs = contributors + out; 3433 3434 // is this the first time this output pixel has been seen? Init it. 3435 if ( out > first_out_inited ) 3436 { 3437 STBIR_ASSERT( out == ( first_out_inited + 1 ) ); // ensure we have only advanced one at time 3438 first_out_inited = out; 3439 contribs->n0 = in_pixel; 3440 contribs->n1 = in_pixel; 3441 coeffs[0] = coeff; 3442 } 3443 else 3444 { 3445 // insert on end (always in order) 3446 if ( coeffs[0] == 0.0f ) // if the first coefficent is zero, then zap it for this coeffs 3447 { 3448 STBIR_ASSERT( ( in_pixel - contribs->n0 ) == 1 ); // ensure that when we zap, we're at the 2nd pos 3449 contribs->n0 = in_pixel; 3450 } 3451 contribs->n1 = in_pixel; 3452 STBIR_ASSERT( ( in_pixel - contribs->n0 ) < coefficient_width ); 3453 coeffs[in_pixel - contribs->n0] = coeff; 3454 } 3455 } 3456 } 3457 } 3458 } 3459 3460 #ifdef STBIR_RENORMALIZE_IN_FLOAT 3461 #define STBIR_RENORM_TYPE float 3462 #else 3463 #define STBIR_RENORM_TYPE double 3464 #endif 3465 3466 static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter_extent_info* filter_info, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float * coefficient_group, int coefficient_width ) 3467 { 3468 int input_size = scale_info->input_full_size; 3469 int input_last_n1 = input_size - 1; 3470 int n, end; 3471 int lowest = 0x7fffffff; 3472 int highest = -0x7fffffff; 3473 int widest = -1; 3474 int numerator = scale_info->scale_numerator; 3475 int denominator = scale_info->scale_denominator; 3476 int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) ); 3477 float * coeffs; 3478 stbir__contributors * contribs; 3479 3480 // weight all the coeffs for each sample 3481 coeffs = coefficient_group; 3482 contribs = contributors; 3483 end = num_contributors; if ( polyphase ) end = numerator; 3484 for (n = 0; n < end; n++) 3485 { 3486 int i; 3487 STBIR_RENORM_TYPE filter_scale, total_filter = 0; 3488 int e; 3489 3490 // add all contribs 3491 e = contribs->n1 - contribs->n0; 3492 for( i = 0 ; i <= e ; i++ ) 3493 { 3494 total_filter += (STBIR_RENORM_TYPE) coeffs[i]; 3495 STBIR_ASSERT( ( coeffs[i] >= -2.0f ) && ( coeffs[i] <= 2.0f ) ); // check for wonky weights 3496 } 3497 3498 // rescale 3499 if ( ( total_filter < stbir__small_float ) && ( total_filter > -stbir__small_float ) ) 3500 { 3501 // all coeffs are extremely small, just zero it 3502 contribs->n1 = contribs->n0; 3503 coeffs[0] = 0.0f; 3504 } 3505 else 3506 { 3507 // if the total isn't 1.0, rescale everything 3508 if ( ( total_filter < (1.0f-stbir__small_float) ) || ( total_filter > (1.0f+stbir__small_float) ) ) 3509 { 3510 filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter; 3511 3512 // scale them all 3513 for (i = 0; i <= e; i++) 3514 coeffs[i] = (float) ( coeffs[i] * filter_scale ); 3515 } 3516 } 3517 ++contribs; 3518 coeffs += coefficient_width; 3519 } 3520 3521 // if we have a rational for the scale, we can exploit the polyphaseness to not calculate 3522 // most of the coefficients, so we copy them here 3523 if ( polyphase ) 3524 { 3525 stbir__contributors * prev_contribs = contributors; 3526 stbir__contributors * cur_contribs = contributors + numerator; 3527 3528 for( n = numerator ; n < num_contributors ; n++ ) 3529 { 3530 cur_contribs->n0 = prev_contribs->n0 + denominator; 3531 cur_contribs->n1 = prev_contribs->n1 + denominator; 3532 ++cur_contribs; 3533 ++prev_contribs; 3534 } 3535 stbir_overlapping_memcpy( coefficient_group + numerator * coefficient_width, coefficient_group, ( num_contributors - numerator ) * coefficient_width * sizeof( coeffs[ 0 ] ) ); 3536 } 3537 3538 coeffs = coefficient_group; 3539 contribs = contributors; 3540 3541 for (n = 0; n < num_contributors; n++) 3542 { 3543 int i; 3544 3545 // in zero edge mode, just remove out of bounds contribs completely (since their weights are accounted for now) 3546 if ( edge == STBIR_EDGE_ZERO ) 3547 { 3548 // shrink the right side if necessary 3549 if ( contribs->n1 > input_last_n1 ) 3550 contribs->n1 = input_last_n1; 3551 3552 // shrink the left side 3553 if ( contribs->n0 < 0 ) 3554 { 3555 int j, left, skips = 0; 3556 3557 skips = -contribs->n0; 3558 contribs->n0 = 0; 3559 3560 // now move down the weights 3561 left = contribs->n1 - contribs->n0 + 1; 3562 if ( left > 0 ) 3563 { 3564 for( j = 0 ; j < left ; j++ ) 3565 coeffs[ j ] = coeffs[ j + skips ]; 3566 } 3567 } 3568 } 3569 else if ( ( edge == STBIR_EDGE_CLAMP ) || ( edge == STBIR_EDGE_REFLECT ) ) 3570 { 3571 // for clamp and reflect, calculate the true inbounds position (based on edge type) and just add that to the existing weight 3572 3573 // right hand side first 3574 if ( contribs->n1 > input_last_n1 ) 3575 { 3576 int start = contribs->n0; 3577 int endi = contribs->n1; 3578 contribs->n1 = input_last_n1; 3579 for( i = input_size; i <= endi; i++ ) 3580 stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), coeffs[i-start], coefficient_width ); 3581 } 3582 3583 // now check left hand edge 3584 if ( contribs->n0 < 0 ) 3585 { 3586 int save_n0; 3587 float save_n0_coeff; 3588 float * c = coeffs - ( contribs->n0 + 1 ); 3589 3590 // reinsert the coeffs with it reflected or clamped (insert accumulates, if the coeffs exist) 3591 for( i = -1 ; i > contribs->n0 ; i-- ) 3592 stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), *c--, coefficient_width ); 3593 save_n0 = contribs->n0; 3594 save_n0_coeff = c[0]; // save it, since we didn't do the final one (i==n0), because there might be too many coeffs to hold (before we resize)! 3595 3596 // now slide all the coeffs down (since we have accumulated them in the positive contribs) and reset the first contrib 3597 contribs->n0 = 0; 3598 for(i = 0 ; i <= contribs->n1 ; i++ ) 3599 coeffs[i] = coeffs[i-save_n0]; 3600 3601 // now that we have shrunk down the contribs, we insert the first one safely 3602 stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( save_n0, input_size ), save_n0_coeff, coefficient_width ); 3603 } 3604 } 3605 3606 if ( contribs->n0 <= contribs->n1 ) 3607 { 3608 int diff = contribs->n1 - contribs->n0 + 1; 3609 while ( diff && ( coeffs[ diff-1 ] == 0.0f ) ) 3610 --diff; 3611 3612 contribs->n1 = contribs->n0 + diff - 1; 3613 3614 if ( contribs->n0 <= contribs->n1 ) 3615 { 3616 if ( contribs->n0 < lowest ) 3617 lowest = contribs->n0; 3618 if ( contribs->n1 > highest ) 3619 highest = contribs->n1; 3620 if ( diff > widest ) 3621 widest = diff; 3622 } 3623 3624 // re-zero out unused coefficients (if any) 3625 for( i = diff ; i < coefficient_width ; i++ ) 3626 coeffs[i] = 0.0f; 3627 } 3628 3629 ++contribs; 3630 coeffs += coefficient_width; 3631 } 3632 filter_info->lowest = lowest; 3633 filter_info->highest = highest; 3634 filter_info->widest = widest; 3635 } 3636 3637 #undef STBIR_RENORM_TYPE 3638 3639 static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row0, int row1 ) 3640 { 3641 #define STBIR_MOVE_1( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint32*)(dest))[0] = ((stbir_uint32*)(src))[0]; } 3642 #define STBIR_MOVE_2( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; } 3643 #ifdef STBIR_SIMD 3644 #define STBIR_MOVE_4( dest, src ) { stbir__simdf t; STBIR_NO_UNROLL(dest); stbir__simdf_load( t, src ); stbir__simdf_store( dest, t ); } 3645 #else 3646 #define STBIR_MOVE_4( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; ((stbir_uint64*)(dest))[1] = ((stbir_uint64*)(src))[1]; } 3647 #endif 3648 3649 int row_end = row1 + 1; 3650 STBIR__UNUSED( row0 ); // only used in an assert 3651 3652 if ( coefficient_width != widest ) 3653 { 3654 float * pc = coefficents; 3655 float * coeffs = coefficents; 3656 float * pc_end = coefficents + num_contributors * widest; 3657 switch( widest ) 3658 { 3659 case 1: 3660 STBIR_NO_UNROLL_LOOP_START 3661 do { 3662 STBIR_MOVE_1( pc, coeffs ); 3663 ++pc; 3664 coeffs += coefficient_width; 3665 } while ( pc < pc_end ); 3666 break; 3667 case 2: 3668 STBIR_NO_UNROLL_LOOP_START 3669 do { 3670 STBIR_MOVE_2( pc, coeffs ); 3671 pc += 2; 3672 coeffs += coefficient_width; 3673 } while ( pc < pc_end ); 3674 break; 3675 case 3: 3676 STBIR_NO_UNROLL_LOOP_START 3677 do { 3678 STBIR_MOVE_2( pc, coeffs ); 3679 STBIR_MOVE_1( pc+2, coeffs+2 ); 3680 pc += 3; 3681 coeffs += coefficient_width; 3682 } while ( pc < pc_end ); 3683 break; 3684 case 4: 3685 STBIR_NO_UNROLL_LOOP_START 3686 do { 3687 STBIR_MOVE_4( pc, coeffs ); 3688 pc += 4; 3689 coeffs += coefficient_width; 3690 } while ( pc < pc_end ); 3691 break; 3692 case 5: 3693 STBIR_NO_UNROLL_LOOP_START 3694 do { 3695 STBIR_MOVE_4( pc, coeffs ); 3696 STBIR_MOVE_1( pc+4, coeffs+4 ); 3697 pc += 5; 3698 coeffs += coefficient_width; 3699 } while ( pc < pc_end ); 3700 break; 3701 case 6: 3702 STBIR_NO_UNROLL_LOOP_START 3703 do { 3704 STBIR_MOVE_4( pc, coeffs ); 3705 STBIR_MOVE_2( pc+4, coeffs+4 ); 3706 pc += 6; 3707 coeffs += coefficient_width; 3708 } while ( pc < pc_end ); 3709 break; 3710 case 7: 3711 STBIR_NO_UNROLL_LOOP_START 3712 do { 3713 STBIR_MOVE_4( pc, coeffs ); 3714 STBIR_MOVE_2( pc+4, coeffs+4 ); 3715 STBIR_MOVE_1( pc+6, coeffs+6 ); 3716 pc += 7; 3717 coeffs += coefficient_width; 3718 } while ( pc < pc_end ); 3719 break; 3720 case 8: 3721 STBIR_NO_UNROLL_LOOP_START 3722 do { 3723 STBIR_MOVE_4( pc, coeffs ); 3724 STBIR_MOVE_4( pc+4, coeffs+4 ); 3725 pc += 8; 3726 coeffs += coefficient_width; 3727 } while ( pc < pc_end ); 3728 break; 3729 case 9: 3730 STBIR_NO_UNROLL_LOOP_START 3731 do { 3732 STBIR_MOVE_4( pc, coeffs ); 3733 STBIR_MOVE_4( pc+4, coeffs+4 ); 3734 STBIR_MOVE_1( pc+8, coeffs+8 ); 3735 pc += 9; 3736 coeffs += coefficient_width; 3737 } while ( pc < pc_end ); 3738 break; 3739 case 10: 3740 STBIR_NO_UNROLL_LOOP_START 3741 do { 3742 STBIR_MOVE_4( pc, coeffs ); 3743 STBIR_MOVE_4( pc+4, coeffs+4 ); 3744 STBIR_MOVE_2( pc+8, coeffs+8 ); 3745 pc += 10; 3746 coeffs += coefficient_width; 3747 } while ( pc < pc_end ); 3748 break; 3749 case 11: 3750 STBIR_NO_UNROLL_LOOP_START 3751 do { 3752 STBIR_MOVE_4( pc, coeffs ); 3753 STBIR_MOVE_4( pc+4, coeffs+4 ); 3754 STBIR_MOVE_2( pc+8, coeffs+8 ); 3755 STBIR_MOVE_1( pc+10, coeffs+10 ); 3756 pc += 11; 3757 coeffs += coefficient_width; 3758 } while ( pc < pc_end ); 3759 break; 3760 case 12: 3761 STBIR_NO_UNROLL_LOOP_START 3762 do { 3763 STBIR_MOVE_4( pc, coeffs ); 3764 STBIR_MOVE_4( pc+4, coeffs+4 ); 3765 STBIR_MOVE_4( pc+8, coeffs+8 ); 3766 pc += 12; 3767 coeffs += coefficient_width; 3768 } while ( pc < pc_end ); 3769 break; 3770 default: 3771 STBIR_NO_UNROLL_LOOP_START 3772 do { 3773 float * copy_end = pc + widest - 4; 3774 float * c = coeffs; 3775 do { 3776 STBIR_NO_UNROLL( pc ); 3777 STBIR_MOVE_4( pc, c ); 3778 pc += 4; 3779 c += 4; 3780 } while ( pc <= copy_end ); 3781 copy_end += 4; 3782 STBIR_NO_UNROLL_LOOP_START 3783 while ( pc < copy_end ) 3784 { 3785 STBIR_MOVE_1( pc, c ); 3786 ++pc; ++c; 3787 } 3788 coeffs += coefficient_width; 3789 } while ( pc < pc_end ); 3790 break; 3791 } 3792 } 3793 3794 // some horizontal routines read one float off the end (which is then masked off), so put in a sentinal so we don't read an snan or denormal 3795 coefficents[ widest * num_contributors ] = 8888.0f; 3796 3797 // the minimum we might read for unrolled filters widths is 12. So, we need to 3798 // make sure we never read outside the decode buffer, by possibly moving 3799 // the sample area back into the scanline, and putting zeros weights first. 3800 // we start on the right edge and check until we're well past the possible 3801 // clip area (2*widest). 3802 { 3803 stbir__contributors * contribs = contributors + num_contributors - 1; 3804 float * coeffs = coefficents + widest * ( num_contributors - 1 ); 3805 3806 // go until no chance of clipping (this is usually less than 8 lops) 3807 while ( ( contribs >= contributors ) && ( ( contribs->n0 + widest*2 ) >= row_end ) ) 3808 { 3809 // might we clip?? 3810 if ( ( contribs->n0 + widest ) > row_end ) 3811 { 3812 int stop_range = widest; 3813 3814 // if range is larger than 12, it will be handled by generic loops that can terminate on the exact length 3815 // of this contrib n1, instead of a fixed widest amount - so calculate this 3816 if ( widest > 12 ) 3817 { 3818 int mod; 3819 3820 // how far will be read in the n_coeff loop (which depends on the widest count mod4); 3821 mod = widest & 3; 3822 stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; 3823 3824 // the n_coeff loops do a minimum amount of coeffs, so factor that in! 3825 if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod; 3826 } 3827 3828 // now see if we still clip with the refined range 3829 if ( ( contribs->n0 + stop_range ) > row_end ) 3830 { 3831 int new_n0 = row_end - stop_range; 3832 int num = contribs->n1 - contribs->n0 + 1; 3833 int backup = contribs->n0 - new_n0; 3834 float * from_co = coeffs + num - 1; 3835 float * to_co = from_co + backup; 3836 3837 STBIR_ASSERT( ( new_n0 >= row0 ) && ( new_n0 < contribs->n0 ) ); 3838 3839 // move the coeffs over 3840 while( num ) 3841 { 3842 *to_co-- = *from_co--; 3843 --num; 3844 } 3845 // zero new positions 3846 while ( to_co >= coeffs ) 3847 *to_co-- = 0; 3848 // set new start point 3849 contribs->n0 = new_n0; 3850 if ( widest > 12 ) 3851 { 3852 int mod; 3853 3854 // how far will be read in the n_coeff loop (which depends on the widest count mod4); 3855 mod = widest & 3; 3856 stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; 3857 3858 // the n_coeff loops do a minimum amount of coeffs, so factor that in! 3859 if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod; 3860 } 3861 } 3862 } 3863 --contribs; 3864 coeffs -= widest; 3865 } 3866 } 3867 3868 return widest; 3869 #undef STBIR_MOVE_1 3870 #undef STBIR_MOVE_2 3871 #undef STBIR_MOVE_4 3872 } 3873 3874 static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * other_axis_for_pivot, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO ) 3875 { 3876 int n; 3877 float scale = samp->scale_info.scale; 3878 stbir__kernel_callback * kernel = samp->filter_kernel; 3879 stbir__support_callback * support = samp->filter_support; 3880 float inv_scale = samp->scale_info.inv_scale; 3881 int input_full_size = samp->scale_info.input_full_size; 3882 int gather_num_contributors = samp->num_contributors; 3883 stbir__contributors* gather_contributors = samp->contributors; 3884 float * gather_coeffs = samp->coefficients; 3885 int gather_coefficient_width = samp->coefficient_width; 3886 3887 switch ( samp->is_gather ) 3888 { 3889 case 1: // gather upsample 3890 { 3891 float out_pixels_radius = support(inv_scale,user_data) * scale; 3892 3893 stbir__calculate_coefficients_for_gather_upsample( out_pixels_radius, kernel, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width, samp->edge, user_data ); 3894 3895 STBIR_PROFILE_BUILD_START( cleanup ); 3896 stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width ); 3897 STBIR_PROFILE_BUILD_END( cleanup ); 3898 } 3899 break; 3900 3901 case 0: // scatter downsample (only on vertical) 3902 case 2: // gather downsample 3903 { 3904 float in_pixels_radius = support(scale,user_data) * inv_scale; 3905 int filter_pixel_margin = samp->filter_pixel_margin; 3906 int input_end = input_full_size + filter_pixel_margin; 3907 3908 // if this is a scatter, we do a downsample gather to get the coeffs, and then pivot after 3909 if ( !samp->is_gather ) 3910 { 3911 // check if we are using the same gather downsample on the horizontal as this vertical, 3912 // if so, then we don't have to generate them, we can just pivot from the horizontal. 3913 if ( other_axis_for_pivot ) 3914 { 3915 gather_contributors = other_axis_for_pivot->contributors; 3916 gather_coeffs = other_axis_for_pivot->coefficients; 3917 gather_coefficient_width = other_axis_for_pivot->coefficient_width; 3918 gather_num_contributors = other_axis_for_pivot->num_contributors; 3919 samp->extent_info.lowest = other_axis_for_pivot->extent_info.lowest; 3920 samp->extent_info.highest = other_axis_for_pivot->extent_info.highest; 3921 samp->extent_info.widest = other_axis_for_pivot->extent_info.widest; 3922 goto jump_right_to_pivot; 3923 } 3924 3925 gather_contributors = samp->gather_prescatter_contributors; 3926 gather_coeffs = samp->gather_prescatter_coefficients; 3927 gather_coefficient_width = samp->gather_prescatter_coefficient_width; 3928 gather_num_contributors = samp->gather_prescatter_num_contributors; 3929 } 3930 3931 stbir__calculate_coefficients_for_gather_downsample( -filter_pixel_margin, input_end, in_pixels_radius, kernel, &samp->scale_info, gather_coefficient_width, gather_num_contributors, gather_contributors, gather_coeffs, user_data ); 3932 3933 STBIR_PROFILE_BUILD_START( cleanup ); 3934 stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width ); 3935 STBIR_PROFILE_BUILD_END( cleanup ); 3936 3937 if ( !samp->is_gather ) 3938 { 3939 // if this is a scatter (vertical only), then we need to pivot the coeffs 3940 stbir__contributors * scatter_contributors; 3941 int highest_set; 3942 3943 jump_right_to_pivot: 3944 3945 STBIR_PROFILE_BUILD_START( pivot ); 3946 3947 highest_set = (-filter_pixel_margin) - 1; 3948 for (n = 0; n < gather_num_contributors; n++) 3949 { 3950 int k; 3951 int gn0 = gather_contributors->n0, gn1 = gather_contributors->n1; 3952 int scatter_coefficient_width = samp->coefficient_width; 3953 float * scatter_coeffs = samp->coefficients + ( gn0 + filter_pixel_margin ) * scatter_coefficient_width; 3954 float * g_coeffs = gather_coeffs; 3955 scatter_contributors = samp->contributors + ( gn0 + filter_pixel_margin ); 3956 3957 for (k = gn0 ; k <= gn1 ; k++ ) 3958 { 3959 float gc = *g_coeffs++; 3960 3961 // skip zero and denormals - must skip zeros to avoid adding coeffs beyond scatter_coefficient_width 3962 // (which happens when pivoting from horizontal, which might have dummy zeros) 3963 if ( ( ( gc >= stbir__small_float ) || ( gc <= -stbir__small_float ) ) ) 3964 { 3965 if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) ) 3966 { 3967 { 3968 // if we are skipping over several contributors, we need to clear the skipped ones 3969 stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1); 3970 while ( clear_contributors < scatter_contributors ) 3971 { 3972 clear_contributors->n0 = 0; 3973 clear_contributors->n1 = -1; 3974 ++clear_contributors; 3975 } 3976 } 3977 scatter_contributors->n0 = n; 3978 scatter_contributors->n1 = n; 3979 scatter_coeffs[0] = gc; 3980 highest_set = k; 3981 } 3982 else 3983 { 3984 stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc, scatter_coefficient_width ); 3985 } 3986 STBIR_ASSERT( ( scatter_contributors->n1 - scatter_contributors->n0 + 1 ) <= scatter_coefficient_width ); 3987 } 3988 ++scatter_contributors; 3989 scatter_coeffs += scatter_coefficient_width; 3990 } 3991 3992 ++gather_contributors; 3993 gather_coeffs += gather_coefficient_width; 3994 } 3995 3996 // now clear any unset contribs 3997 { 3998 stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1); 3999 stbir__contributors * end_contributors = samp->contributors + samp->num_contributors; 4000 while ( clear_contributors < end_contributors ) 4001 { 4002 clear_contributors->n0 = 0; 4003 clear_contributors->n1 = -1; 4004 ++clear_contributors; 4005 } 4006 } 4007 4008 STBIR_PROFILE_BUILD_END( pivot ); 4009 } 4010 } 4011 break; 4012 } 4013 } 4014 4015 4016 //======================================================================================================== 4017 // scanline decoders and encoders 4018 4019 #define stbir__coder_min_num 1 4020 #define STB_IMAGE_RESIZE_DO_CODERS 4021 #include STBIR__HEADER_FILENAME 4022 4023 #define stbir__decode_suffix BGRA 4024 #define stbir__decode_swizzle 4025 #define stbir__decode_order0 2 4026 #define stbir__decode_order1 1 4027 #define stbir__decode_order2 0 4028 #define stbir__decode_order3 3 4029 #define stbir__encode_order0 2 4030 #define stbir__encode_order1 1 4031 #define stbir__encode_order2 0 4032 #define stbir__encode_order3 3 4033 #define stbir__coder_min_num 4 4034 #define STB_IMAGE_RESIZE_DO_CODERS 4035 #include STBIR__HEADER_FILENAME 4036 4037 #define stbir__decode_suffix ARGB 4038 #define stbir__decode_swizzle 4039 #define stbir__decode_order0 1 4040 #define stbir__decode_order1 2 4041 #define stbir__decode_order2 3 4042 #define stbir__decode_order3 0 4043 #define stbir__encode_order0 3 4044 #define stbir__encode_order1 0 4045 #define stbir__encode_order2 1 4046 #define stbir__encode_order3 2 4047 #define stbir__coder_min_num 4 4048 #define STB_IMAGE_RESIZE_DO_CODERS 4049 #include STBIR__HEADER_FILENAME 4050 4051 #define stbir__decode_suffix ABGR 4052 #define stbir__decode_swizzle 4053 #define stbir__decode_order0 3 4054 #define stbir__decode_order1 2 4055 #define stbir__decode_order2 1 4056 #define stbir__decode_order3 0 4057 #define stbir__encode_order0 3 4058 #define stbir__encode_order1 2 4059 #define stbir__encode_order2 1 4060 #define stbir__encode_order3 0 4061 #define stbir__coder_min_num 4 4062 #define STB_IMAGE_RESIZE_DO_CODERS 4063 #include STBIR__HEADER_FILENAME 4064 4065 #define stbir__decode_suffix AR 4066 #define stbir__decode_swizzle 4067 #define stbir__decode_order0 1 4068 #define stbir__decode_order1 0 4069 #define stbir__decode_order2 3 4070 #define stbir__decode_order3 2 4071 #define stbir__encode_order0 1 4072 #define stbir__encode_order1 0 4073 #define stbir__encode_order2 3 4074 #define stbir__encode_order3 2 4075 #define stbir__coder_min_num 2 4076 #define STB_IMAGE_RESIZE_DO_CODERS 4077 #include STBIR__HEADER_FILENAME 4078 4079 4080 // fancy alpha means we expand to keep both premultipied and non-premultiplied color channels 4081 static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_channels ) 4082 { 4083 float STBIR_STREAMOUT_PTR(*) out = out_buffer; 4084 float const * end_decode = out_buffer + ( width_times_channels / 4 ) * 7; // decode buffer aligned to end of out_buffer 4085 float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels; 4086 4087 // fancy alpha is stored internally as R G B A Rpm Gpm Bpm 4088 4089 #ifdef STBIR_SIMD 4090 4091 #ifdef STBIR_SIMD8 4092 decode += 16; 4093 STBIR_NO_UNROLL_LOOP_START 4094 while ( decode <= end_decode ) 4095 { 4096 stbir__simdf8 d0,d1,a0,a1,p0,p1; 4097 STBIR_NO_UNROLL(decode); 4098 stbir__simdf8_load( d0, decode-16 ); 4099 stbir__simdf8_load( d1, decode-16+8 ); 4100 stbir__simdf8_0123to33333333( a0, d0 ); 4101 stbir__simdf8_0123to33333333( a1, d1 ); 4102 stbir__simdf8_mult( p0, a0, d0 ); 4103 stbir__simdf8_mult( p1, a1, d1 ); 4104 stbir__simdf8_bot4s( a0, d0, p0 ); 4105 stbir__simdf8_bot4s( a1, d1, p1 ); 4106 stbir__simdf8_top4s( d0, d0, p0 ); 4107 stbir__simdf8_top4s( d1, d1, p1 ); 4108 stbir__simdf8_store ( out, a0 ); 4109 stbir__simdf8_store ( out+7, d0 ); 4110 stbir__simdf8_store ( out+14, a1 ); 4111 stbir__simdf8_store ( out+21, d1 ); 4112 decode += 16; 4113 out += 28; 4114 } 4115 decode -= 16; 4116 #else 4117 decode += 8; 4118 STBIR_NO_UNROLL_LOOP_START 4119 while ( decode <= end_decode ) 4120 { 4121 stbir__simdf d0,a0,d1,a1,p0,p1; 4122 STBIR_NO_UNROLL(decode); 4123 stbir__simdf_load( d0, decode-8 ); 4124 stbir__simdf_load( d1, decode-8+4 ); 4125 stbir__simdf_0123to3333( a0, d0 ); 4126 stbir__simdf_0123to3333( a1, d1 ); 4127 stbir__simdf_mult( p0, a0, d0 ); 4128 stbir__simdf_mult( p1, a1, d1 ); 4129 stbir__simdf_store ( out, d0 ); 4130 stbir__simdf_store ( out+4, p0 ); 4131 stbir__simdf_store ( out+7, d1 ); 4132 stbir__simdf_store ( out+7+4, p1 ); 4133 decode += 8; 4134 out += 14; 4135 } 4136 decode -= 8; 4137 #endif 4138 4139 // might be one last odd pixel 4140 #ifdef STBIR_SIMD8 4141 STBIR_NO_UNROLL_LOOP_START 4142 while ( decode < end_decode ) 4143 #else 4144 if ( decode < end_decode ) 4145 #endif 4146 { 4147 stbir__simdf d,a,p; 4148 STBIR_NO_UNROLL(decode); 4149 stbir__simdf_load( d, decode ); 4150 stbir__simdf_0123to3333( a, d ); 4151 stbir__simdf_mult( p, a, d ); 4152 stbir__simdf_store ( out, d ); 4153 stbir__simdf_store ( out+4, p ); 4154 decode += 4; 4155 out += 7; 4156 } 4157 4158 #else 4159 4160 while( decode < end_decode ) 4161 { 4162 float r = decode[0], g = decode[1], b = decode[2], alpha = decode[3]; 4163 out[0] = r; 4164 out[1] = g; 4165 out[2] = b; 4166 out[3] = alpha; 4167 out[4] = r * alpha; 4168 out[5] = g * alpha; 4169 out[6] = b * alpha; 4170 out += 7; 4171 decode += 4; 4172 } 4173 4174 #endif 4175 } 4176 4177 static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_channels ) 4178 { 4179 float STBIR_STREAMOUT_PTR(*) out = out_buffer; 4180 float const * end_decode = out_buffer + ( width_times_channels / 2 ) * 3; 4181 float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels; 4182 4183 // for fancy alpha, turns into: [X A Xpm][X A Xpm],etc 4184 4185 #ifdef STBIR_SIMD 4186 4187 decode += 8; 4188 if ( decode <= end_decode ) 4189 { 4190 STBIR_NO_UNROLL_LOOP_START 4191 do { 4192 #ifdef STBIR_SIMD8 4193 stbir__simdf8 d0,a0,p0; 4194 STBIR_NO_UNROLL(decode); 4195 stbir__simdf8_load( d0, decode-8 ); 4196 stbir__simdf8_0123to11331133( p0, d0 ); 4197 stbir__simdf8_0123to00220022( a0, d0 ); 4198 stbir__simdf8_mult( p0, p0, a0 ); 4199 4200 stbir__simdf_store2( out, stbir__if_simdf8_cast_to_simdf4( d0 ) ); 4201 stbir__simdf_store( out+2, stbir__if_simdf8_cast_to_simdf4( p0 ) ); 4202 stbir__simdf_store2h( out+3, stbir__if_simdf8_cast_to_simdf4( d0 ) ); 4203 4204 stbir__simdf_store2( out+6, stbir__simdf8_gettop4( d0 ) ); 4205 stbir__simdf_store( out+8, stbir__simdf8_gettop4( p0 ) ); 4206 stbir__simdf_store2h( out+9, stbir__simdf8_gettop4( d0 ) ); 4207 #else 4208 stbir__simdf d0,a0,d1,a1,p0,p1; 4209 STBIR_NO_UNROLL(decode); 4210 stbir__simdf_load( d0, decode-8 ); 4211 stbir__simdf_load( d1, decode-8+4 ); 4212 stbir__simdf_0123to1133( p0, d0 ); 4213 stbir__simdf_0123to1133( p1, d1 ); 4214 stbir__simdf_0123to0022( a0, d0 ); 4215 stbir__simdf_0123to0022( a1, d1 ); 4216 stbir__simdf_mult( p0, p0, a0 ); 4217 stbir__simdf_mult( p1, p1, a1 ); 4218 4219 stbir__simdf_store2( out, d0 ); 4220 stbir__simdf_store( out+2, p0 ); 4221 stbir__simdf_store2h( out+3, d0 ); 4222 4223 stbir__simdf_store2( out+6, d1 ); 4224 stbir__simdf_store( out+8, p1 ); 4225 stbir__simdf_store2h( out+9, d1 ); 4226 #endif 4227 decode += 8; 4228 out += 12; 4229 } while ( decode <= end_decode ); 4230 } 4231 decode -= 8; 4232 #endif 4233 4234 STBIR_SIMD_NO_UNROLL_LOOP_START 4235 while( decode < end_decode ) 4236 { 4237 float x = decode[0], y = decode[1]; 4238 STBIR_SIMD_NO_UNROLL(decode); 4239 out[0] = x; 4240 out[1] = y; 4241 out[2] = x * y; 4242 out += 3; 4243 decode += 2; 4244 } 4245 } 4246 4247 static void stbir__fancy_alpha_unweight_4ch( float * encode_buffer, int width_times_channels ) 4248 { 4249 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer; 4250 float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer; 4251 float const * end_output = encode_buffer + width_times_channels; 4252 4253 // fancy RGBA is stored internally as R G B A Rpm Gpm Bpm 4254 4255 STBIR_SIMD_NO_UNROLL_LOOP_START 4256 do { 4257 float alpha = input[3]; 4258 #ifdef STBIR_SIMD 4259 stbir__simdf i,ia; 4260 STBIR_SIMD_NO_UNROLL(encode); 4261 if ( alpha < stbir__small_float ) 4262 { 4263 stbir__simdf_load( i, input ); 4264 stbir__simdf_store( encode, i ); 4265 } 4266 else 4267 { 4268 stbir__simdf_load1frep4( ia, 1.0f / alpha ); 4269 stbir__simdf_load( i, input+4 ); 4270 stbir__simdf_mult( i, i, ia ); 4271 stbir__simdf_store( encode, i ); 4272 encode[3] = alpha; 4273 } 4274 #else 4275 if ( alpha < stbir__small_float ) 4276 { 4277 encode[0] = input[0]; 4278 encode[1] = input[1]; 4279 encode[2] = input[2]; 4280 } 4281 else 4282 { 4283 float ialpha = 1.0f / alpha; 4284 encode[0] = input[4] * ialpha; 4285 encode[1] = input[5] * ialpha; 4286 encode[2] = input[6] * ialpha; 4287 } 4288 encode[3] = alpha; 4289 #endif 4290 4291 input += 7; 4292 encode += 4; 4293 } while ( encode < end_output ); 4294 } 4295 4296 // format: [X A Xpm][X A Xpm] etc 4297 static void stbir__fancy_alpha_unweight_2ch( float * encode_buffer, int width_times_channels ) 4298 { 4299 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer; 4300 float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer; 4301 float const * end_output = encode_buffer + width_times_channels; 4302 4303 do { 4304 float alpha = input[1]; 4305 encode[0] = input[0]; 4306 if ( alpha >= stbir__small_float ) 4307 encode[0] = input[2] / alpha; 4308 encode[1] = alpha; 4309 4310 input += 3; 4311 encode += 2; 4312 } while ( encode < end_output ); 4313 } 4314 4315 static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_times_channels ) 4316 { 4317 float STBIR_STREAMOUT_PTR(*) decode = decode_buffer; 4318 float const * end_decode = decode_buffer + width_times_channels; 4319 4320 #ifdef STBIR_SIMD 4321 { 4322 decode += 2 * stbir__simdfX_float_count; 4323 STBIR_NO_UNROLL_LOOP_START 4324 while ( decode <= end_decode ) 4325 { 4326 stbir__simdfX d0,a0,d1,a1; 4327 STBIR_NO_UNROLL(decode); 4328 stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count ); 4329 stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count ); 4330 stbir__simdfX_aaa1( a0, d0, STBIR_onesX ); 4331 stbir__simdfX_aaa1( a1, d1, STBIR_onesX ); 4332 stbir__simdfX_mult( d0, d0, a0 ); 4333 stbir__simdfX_mult( d1, d1, a1 ); 4334 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 ); 4335 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 ); 4336 decode += 2 * stbir__simdfX_float_count; 4337 } 4338 decode -= 2 * stbir__simdfX_float_count; 4339 4340 // few last pixels remnants 4341 #ifdef STBIR_SIMD8 4342 STBIR_NO_UNROLL_LOOP_START 4343 while ( decode < end_decode ) 4344 #else 4345 if ( decode < end_decode ) 4346 #endif 4347 { 4348 stbir__simdf d,a; 4349 stbir__simdf_load( d, decode ); 4350 stbir__simdf_aaa1( a, d, STBIR__CONSTF(STBIR_ones) ); 4351 stbir__simdf_mult( d, d, a ); 4352 stbir__simdf_store ( decode, d ); 4353 decode += 4; 4354 } 4355 } 4356 4357 #else 4358 4359 while( decode < end_decode ) 4360 { 4361 float alpha = decode[3]; 4362 decode[0] *= alpha; 4363 decode[1] *= alpha; 4364 decode[2] *= alpha; 4365 decode += 4; 4366 } 4367 4368 #endif 4369 } 4370 4371 static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_times_channels ) 4372 { 4373 float STBIR_STREAMOUT_PTR(*) decode = decode_buffer; 4374 float const * end_decode = decode_buffer + width_times_channels; 4375 4376 #ifdef STBIR_SIMD 4377 decode += 2 * stbir__simdfX_float_count; 4378 STBIR_NO_UNROLL_LOOP_START 4379 while ( decode <= end_decode ) 4380 { 4381 stbir__simdfX d0,a0,d1,a1; 4382 STBIR_NO_UNROLL(decode); 4383 stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count ); 4384 stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count ); 4385 stbir__simdfX_a1a1( a0, d0, STBIR_onesX ); 4386 stbir__simdfX_a1a1( a1, d1, STBIR_onesX ); 4387 stbir__simdfX_mult( d0, d0, a0 ); 4388 stbir__simdfX_mult( d1, d1, a1 ); 4389 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 ); 4390 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 ); 4391 decode += 2 * stbir__simdfX_float_count; 4392 } 4393 decode -= 2 * stbir__simdfX_float_count; 4394 #endif 4395 4396 STBIR_SIMD_NO_UNROLL_LOOP_START 4397 while( decode < end_decode ) 4398 { 4399 float alpha = decode[1]; 4400 STBIR_SIMD_NO_UNROLL(decode); 4401 decode[0] *= alpha; 4402 decode += 2; 4403 } 4404 } 4405 4406 static void stbir__simple_alpha_unweight_4ch( float * encode_buffer, int width_times_channels ) 4407 { 4408 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer; 4409 float const * end_output = encode_buffer + width_times_channels; 4410 4411 STBIR_SIMD_NO_UNROLL_LOOP_START 4412 do { 4413 float alpha = encode[3]; 4414 4415 #ifdef STBIR_SIMD 4416 stbir__simdf i,ia; 4417 STBIR_SIMD_NO_UNROLL(encode); 4418 if ( alpha >= stbir__small_float ) 4419 { 4420 stbir__simdf_load1frep4( ia, 1.0f / alpha ); 4421 stbir__simdf_load( i, encode ); 4422 stbir__simdf_mult( i, i, ia ); 4423 stbir__simdf_store( encode, i ); 4424 encode[3] = alpha; 4425 } 4426 #else 4427 if ( alpha >= stbir__small_float ) 4428 { 4429 float ialpha = 1.0f / alpha; 4430 encode[0] *= ialpha; 4431 encode[1] *= ialpha; 4432 encode[2] *= ialpha; 4433 } 4434 #endif 4435 encode += 4; 4436 } while ( encode < end_output ); 4437 } 4438 4439 static void stbir__simple_alpha_unweight_2ch( float * encode_buffer, int width_times_channels ) 4440 { 4441 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer; 4442 float const * end_output = encode_buffer + width_times_channels; 4443 4444 do { 4445 float alpha = encode[1]; 4446 if ( alpha >= stbir__small_float ) 4447 encode[0] /= alpha; 4448 encode += 2; 4449 } while ( encode < end_output ); 4450 } 4451 4452 4453 // only used in RGB->BGR or BGR->RGB 4454 static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_channels ) 4455 { 4456 float STBIR_STREAMOUT_PTR(*) decode = decode_buffer; 4457 float const * end_decode = decode_buffer + width_times_channels; 4458 4459 #ifdef STBIR_SIMD 4460 #ifdef stbir__simdf_swiz2 // do we have two argument swizzles? 4461 end_decode -= 12; 4462 STBIR_NO_UNROLL_LOOP_START 4463 while( decode <= end_decode ) 4464 { 4465 // on arm64 8 instructions, no overlapping stores 4466 stbir__simdf a,b,c,na,nb; 4467 STBIR_SIMD_NO_UNROLL(decode); 4468 stbir__simdf_load( a, decode ); 4469 stbir__simdf_load( b, decode+4 ); 4470 stbir__simdf_load( c, decode+8 ); 4471 4472 na = stbir__simdf_swiz2( a, b, 2, 1, 0, 5 ); 4473 b = stbir__simdf_swiz2( a, b, 4, 3, 6, 7 ); 4474 nb = stbir__simdf_swiz2( b, c, 0, 1, 4, 3 ); 4475 c = stbir__simdf_swiz2( b, c, 2, 7, 6, 5 ); 4476 4477 stbir__simdf_store( decode, na ); 4478 stbir__simdf_store( decode+4, nb ); 4479 stbir__simdf_store( decode+8, c ); 4480 decode += 12; 4481 } 4482 end_decode += 12; 4483 #else 4484 end_decode -= 24; 4485 STBIR_NO_UNROLL_LOOP_START 4486 while( decode <= end_decode ) 4487 { 4488 // 26 instructions on x64 4489 stbir__simdf a,b,c,d,e,f,g; 4490 float i21, i23; 4491 STBIR_SIMD_NO_UNROLL(decode); 4492 stbir__simdf_load( a, decode ); 4493 stbir__simdf_load( b, decode+3 ); 4494 stbir__simdf_load( c, decode+6 ); 4495 stbir__simdf_load( d, decode+9 ); 4496 stbir__simdf_load( e, decode+12 ); 4497 stbir__simdf_load( f, decode+15 ); 4498 stbir__simdf_load( g, decode+18 ); 4499 4500 a = stbir__simdf_swiz( a, 2, 1, 0, 3 ); 4501 b = stbir__simdf_swiz( b, 2, 1, 0, 3 ); 4502 c = stbir__simdf_swiz( c, 2, 1, 0, 3 ); 4503 d = stbir__simdf_swiz( d, 2, 1, 0, 3 ); 4504 e = stbir__simdf_swiz( e, 2, 1, 0, 3 ); 4505 f = stbir__simdf_swiz( f, 2, 1, 0, 3 ); 4506 g = stbir__simdf_swiz( g, 2, 1, 0, 3 ); 4507 4508 // stores overlap, need to be in order, 4509 stbir__simdf_store( decode, a ); 4510 i21 = decode[21]; 4511 stbir__simdf_store( decode+3, b ); 4512 i23 = decode[23]; 4513 stbir__simdf_store( decode+6, c ); 4514 stbir__simdf_store( decode+9, d ); 4515 stbir__simdf_store( decode+12, e ); 4516 stbir__simdf_store( decode+15, f ); 4517 stbir__simdf_store( decode+18, g ); 4518 decode[21] = i23; 4519 decode[23] = i21; 4520 decode += 24; 4521 } 4522 end_decode += 24; 4523 #endif 4524 #else 4525 end_decode -= 12; 4526 STBIR_NO_UNROLL_LOOP_START 4527 while( decode <= end_decode ) 4528 { 4529 // 16 instructions 4530 float t0,t1,t2,t3; 4531 STBIR_NO_UNROLL(decode); 4532 t0 = decode[0]; t1 = decode[3]; t2 = decode[6]; t3 = decode[9]; 4533 decode[0] = decode[2]; decode[3] = decode[5]; decode[6] = decode[8]; decode[9] = decode[11]; 4534 decode[2] = t0; decode[5] = t1; decode[8] = t2; decode[11] = t3; 4535 decode += 12; 4536 } 4537 end_decode += 12; 4538 #endif 4539 4540 STBIR_NO_UNROLL_LOOP_START 4541 while( decode < end_decode ) 4542 { 4543 float t = decode[0]; 4544 STBIR_NO_UNROLL(decode); 4545 decode[0] = decode[2]; 4546 decode[2] = t; 4547 decode += 3; 4548 } 4549 } 4550 4551 4552 4553 static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float * output_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO ) 4554 { 4555 int channels = stbir_info->channels; 4556 int effective_channels = stbir_info->effective_channels; 4557 int input_sample_in_bytes = stbir__type_size[stbir_info->input_type] * channels; 4558 stbir_edge edge_horizontal = stbir_info->horizontal.edge; 4559 stbir_edge edge_vertical = stbir_info->vertical.edge; 4560 int row = stbir__edge_wrap(edge_vertical, n, stbir_info->vertical.scale_info.input_full_size); 4561 const void* input_plane_data = ( (char *) stbir_info->input_data ) + (size_t)row * (size_t) stbir_info->input_stride_bytes; 4562 stbir__span const * spans = stbir_info->scanline_extents.spans; 4563 float* full_decode_buffer = output_buffer - stbir_info->scanline_extents.conservative.n0 * effective_channels; 4564 4565 // if we are on edge_zero, and we get in here with an out of bounds n, then the calculate filters has failed 4566 STBIR_ASSERT( !(edge_vertical == STBIR_EDGE_ZERO && (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)) ); 4567 4568 do 4569 { 4570 float * decode_buffer; 4571 void const * input_data; 4572 float * end_decode; 4573 int width_times_channels; 4574 int width; 4575 4576 if ( spans->n1 < spans->n0 ) 4577 break; 4578 4579 width = spans->n1 + 1 - spans->n0; 4580 decode_buffer = full_decode_buffer + spans->n0 * effective_channels; 4581 end_decode = full_decode_buffer + ( spans->n1 + 1 ) * effective_channels; 4582 width_times_channels = width * channels; 4583 4584 // read directly out of input plane by default 4585 input_data = ( (char*)input_plane_data ) + spans->pixel_offset_for_input * input_sample_in_bytes; 4586 4587 // if we have an input callback, call it to get the input data 4588 if ( stbir_info->in_pixels_cb ) 4589 { 4590 // call the callback with a temp buffer (that they can choose to use or not). the temp is just right aligned memory in the decode_buffer itself 4591 input_data = stbir_info->in_pixels_cb( ( (char*) end_decode ) - ( width * input_sample_in_bytes ), input_plane_data, width, spans->pixel_offset_for_input, row, stbir_info->user_data ); 4592 } 4593 4594 STBIR_PROFILE_START( decode ); 4595 // convert the pixels info the float decode_buffer, (we index from end_decode, so that when channels<effective_channels, we are right justified in the buffer) 4596 stbir_info->decode_pixels( (float*)end_decode - width_times_channels, width_times_channels, input_data ); 4597 STBIR_PROFILE_END( decode ); 4598 4599 if (stbir_info->alpha_weight) 4600 { 4601 STBIR_PROFILE_START( alpha ); 4602 stbir_info->alpha_weight( decode_buffer, width_times_channels ); 4603 STBIR_PROFILE_END( alpha ); 4604 } 4605 4606 ++spans; 4607 } while ( spans <= ( &stbir_info->scanline_extents.spans[1] ) ); 4608 4609 // handle the edge_wrap filter (all other types are handled back out at the calculate_filter stage) 4610 // basically the idea here is that if we have the whole scanline in memory, we don't redecode the 4611 // wrapped edge pixels, and instead just memcpy them from the scanline into the edge positions 4612 if ( ( edge_horizontal == STBIR_EDGE_WRAP ) && ( stbir_info->scanline_extents.edge_sizes[0] | stbir_info->scanline_extents.edge_sizes[1] ) ) 4613 { 4614 // this code only runs if we're in edge_wrap, and we're doing the entire scanline 4615 int e, start_x[2]; 4616 int input_full_size = stbir_info->horizontal.scale_info.input_full_size; 4617 4618 start_x[0] = -stbir_info->scanline_extents.edge_sizes[0]; // left edge start x 4619 start_x[1] = input_full_size; // right edge 4620 4621 for( e = 0; e < 2 ; e++ ) 4622 { 4623 // do each margin 4624 int margin = stbir_info->scanline_extents.edge_sizes[e]; 4625 if ( margin ) 4626 { 4627 int x = start_x[e]; 4628 float * marg = full_decode_buffer + x * effective_channels; 4629 float const * src = full_decode_buffer + stbir__edge_wrap(edge_horizontal, x, input_full_size) * effective_channels; 4630 STBIR_MEMCPY( marg, src, margin * effective_channels * sizeof(float) ); 4631 } 4632 } 4633 } 4634 } 4635 4636 4637 //================= 4638 // Do 1 channel horizontal routines 4639 4640 #ifdef STBIR_SIMD 4641 4642 #define stbir__1_coeff_only() \ 4643 stbir__simdf tot,c; \ 4644 STBIR_SIMD_NO_UNROLL(decode); \ 4645 stbir__simdf_load1( c, hc ); \ 4646 stbir__simdf_mult1_mem( tot, c, decode ); 4647 4648 #define stbir__2_coeff_only() \ 4649 stbir__simdf tot,c,d; \ 4650 STBIR_SIMD_NO_UNROLL(decode); \ 4651 stbir__simdf_load2z( c, hc ); \ 4652 stbir__simdf_load2( d, decode ); \ 4653 stbir__simdf_mult( tot, c, d ); \ 4654 stbir__simdf_0123to1230( c, tot ); \ 4655 stbir__simdf_add1( tot, tot, c ); 4656 4657 #define stbir__3_coeff_only() \ 4658 stbir__simdf tot,c,t; \ 4659 STBIR_SIMD_NO_UNROLL(decode); \ 4660 stbir__simdf_load( c, hc ); \ 4661 stbir__simdf_mult_mem( tot, c, decode ); \ 4662 stbir__simdf_0123to1230( c, tot ); \ 4663 stbir__simdf_0123to2301( t, tot ); \ 4664 stbir__simdf_add1( tot, tot, c ); \ 4665 stbir__simdf_add1( tot, tot, t ); 4666 4667 #define stbir__store_output_tiny() \ 4668 stbir__simdf_store1( output, tot ); \ 4669 horizontal_coefficients += coefficient_width; \ 4670 ++horizontal_contributors; \ 4671 output += 1; 4672 4673 #define stbir__4_coeff_start() \ 4674 stbir__simdf tot,c; \ 4675 STBIR_SIMD_NO_UNROLL(decode); \ 4676 stbir__simdf_load( c, hc ); \ 4677 stbir__simdf_mult_mem( tot, c, decode ); \ 4678 4679 #define stbir__4_coeff_continue_from_4( ofs ) \ 4680 STBIR_SIMD_NO_UNROLL(decode); \ 4681 stbir__simdf_load( c, hc + (ofs) ); \ 4682 stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) ); 4683 4684 #define stbir__1_coeff_remnant( ofs ) \ 4685 { stbir__simdf d; \ 4686 stbir__simdf_load1z( c, hc + (ofs) ); \ 4687 stbir__simdf_load1( d, decode + (ofs) ); \ 4688 stbir__simdf_madd( tot, tot, d, c ); } 4689 4690 #define stbir__2_coeff_remnant( ofs ) \ 4691 { stbir__simdf d; \ 4692 stbir__simdf_load2z( c, hc+(ofs) ); \ 4693 stbir__simdf_load2( d, decode+(ofs) ); \ 4694 stbir__simdf_madd( tot, tot, d, c ); } 4695 4696 #define stbir__3_coeff_setup() \ 4697 stbir__simdf mask; \ 4698 stbir__simdf_load( mask, STBIR_mask + 3 ); 4699 4700 #define stbir__3_coeff_remnant( ofs ) \ 4701 stbir__simdf_load( c, hc+(ofs) ); \ 4702 stbir__simdf_and( c, c, mask ); \ 4703 stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) ); 4704 4705 #define stbir__store_output() \ 4706 stbir__simdf_0123to2301( c, tot ); \ 4707 stbir__simdf_add( tot, tot, c ); \ 4708 stbir__simdf_0123to1230( c, tot ); \ 4709 stbir__simdf_add1( tot, tot, c ); \ 4710 stbir__simdf_store1( output, tot ); \ 4711 horizontal_coefficients += coefficient_width; \ 4712 ++horizontal_contributors; \ 4713 output += 1; 4714 4715 #else 4716 4717 #define stbir__1_coeff_only() \ 4718 float tot; \ 4719 tot = decode[0]*hc[0]; 4720 4721 #define stbir__2_coeff_only() \ 4722 float tot; \ 4723 tot = decode[0] * hc[0]; \ 4724 tot += decode[1] * hc[1]; 4725 4726 #define stbir__3_coeff_only() \ 4727 float tot; \ 4728 tot = decode[0] * hc[0]; \ 4729 tot += decode[1] * hc[1]; \ 4730 tot += decode[2] * hc[2]; 4731 4732 #define stbir__store_output_tiny() \ 4733 output[0] = tot; \ 4734 horizontal_coefficients += coefficient_width; \ 4735 ++horizontal_contributors; \ 4736 output += 1; 4737 4738 #define stbir__4_coeff_start() \ 4739 float tot0,tot1,tot2,tot3; \ 4740 tot0 = decode[0] * hc[0]; \ 4741 tot1 = decode[1] * hc[1]; \ 4742 tot2 = decode[2] * hc[2]; \ 4743 tot3 = decode[3] * hc[3]; 4744 4745 #define stbir__4_coeff_continue_from_4( ofs ) \ 4746 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \ 4747 tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \ 4748 tot2 += decode[2+(ofs)] * hc[2+(ofs)]; \ 4749 tot3 += decode[3+(ofs)] * hc[3+(ofs)]; 4750 4751 #define stbir__1_coeff_remnant( ofs ) \ 4752 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; 4753 4754 #define stbir__2_coeff_remnant( ofs ) \ 4755 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \ 4756 tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \ 4757 4758 #define stbir__3_coeff_remnant( ofs ) \ 4759 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \ 4760 tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \ 4761 tot2 += decode[2+(ofs)] * hc[2+(ofs)]; 4762 4763 #define stbir__store_output() \ 4764 output[0] = (tot0+tot2)+(tot1+tot3); \ 4765 horizontal_coefficients += coefficient_width; \ 4766 ++horizontal_contributors; \ 4767 output += 1; 4768 4769 #endif 4770 4771 #define STBIR__horizontal_channels 1 4772 #define STB_IMAGE_RESIZE_DO_HORIZONTALS 4773 #include STBIR__HEADER_FILENAME 4774 4775 4776 //================= 4777 // Do 2 channel horizontal routines 4778 4779 #ifdef STBIR_SIMD 4780 4781 #define stbir__1_coeff_only() \ 4782 stbir__simdf tot,c,d; \ 4783 STBIR_SIMD_NO_UNROLL(decode); \ 4784 stbir__simdf_load1z( c, hc ); \ 4785 stbir__simdf_0123to0011( c, c ); \ 4786 stbir__simdf_load2( d, decode ); \ 4787 stbir__simdf_mult( tot, d, c ); 4788 4789 #define stbir__2_coeff_only() \ 4790 stbir__simdf tot,c; \ 4791 STBIR_SIMD_NO_UNROLL(decode); \ 4792 stbir__simdf_load2( c, hc ); \ 4793 stbir__simdf_0123to0011( c, c ); \ 4794 stbir__simdf_mult_mem( tot, c, decode ); 4795 4796 #define stbir__3_coeff_only() \ 4797 stbir__simdf tot,c,cs,d; \ 4798 STBIR_SIMD_NO_UNROLL(decode); \ 4799 stbir__simdf_load( cs, hc ); \ 4800 stbir__simdf_0123to0011( c, cs ); \ 4801 stbir__simdf_mult_mem( tot, c, decode ); \ 4802 stbir__simdf_0123to2222( c, cs ); \ 4803 stbir__simdf_load2z( d, decode+4 ); \ 4804 stbir__simdf_madd( tot, tot, d, c ); 4805 4806 #define stbir__store_output_tiny() \ 4807 stbir__simdf_0123to2301( c, tot ); \ 4808 stbir__simdf_add( tot, tot, c ); \ 4809 stbir__simdf_store2( output, tot ); \ 4810 horizontal_coefficients += coefficient_width; \ 4811 ++horizontal_contributors; \ 4812 output += 2; 4813 4814 #ifdef STBIR_SIMD8 4815 4816 #define stbir__4_coeff_start() \ 4817 stbir__simdf8 tot0,c,cs; \ 4818 STBIR_SIMD_NO_UNROLL(decode); \ 4819 stbir__simdf8_load4b( cs, hc ); \ 4820 stbir__simdf8_0123to00112233( c, cs ); \ 4821 stbir__simdf8_mult_mem( tot0, c, decode ); 4822 4823 #define stbir__4_coeff_continue_from_4( ofs ) \ 4824 STBIR_SIMD_NO_UNROLL(decode); \ 4825 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 4826 stbir__simdf8_0123to00112233( c, cs ); \ 4827 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); 4828 4829 #define stbir__1_coeff_remnant( ofs ) \ 4830 { stbir__simdf t,d; \ 4831 stbir__simdf_load1z( t, hc + (ofs) ); \ 4832 stbir__simdf_load2( d, decode + (ofs) * 2 ); \ 4833 stbir__simdf_0123to0011( t, t ); \ 4834 stbir__simdf_mult( t, t, d ); \ 4835 stbir__simdf8_add4( tot0, tot0, t ); } 4836 4837 #define stbir__2_coeff_remnant( ofs ) \ 4838 { stbir__simdf t; \ 4839 stbir__simdf_load2( t, hc + (ofs) ); \ 4840 stbir__simdf_0123to0011( t, t ); \ 4841 stbir__simdf_mult_mem( t, t, decode+(ofs)*2 ); \ 4842 stbir__simdf8_add4( tot0, tot0, t ); } 4843 4844 #define stbir__3_coeff_remnant( ofs ) \ 4845 { stbir__simdf8 d; \ 4846 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 4847 stbir__simdf8_0123to00112233( c, cs ); \ 4848 stbir__simdf8_load6z( d, decode+(ofs)*2 ); \ 4849 stbir__simdf8_madd( tot0, tot0, c, d ); } 4850 4851 #define stbir__store_output() \ 4852 { stbir__simdf t,d; \ 4853 stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 ); \ 4854 stbir__simdf_0123to2301( d, t ); \ 4855 stbir__simdf_add( t, t, d ); \ 4856 stbir__simdf_store2( output, t ); \ 4857 horizontal_coefficients += coefficient_width; \ 4858 ++horizontal_contributors; \ 4859 output += 2; } 4860 4861 #else 4862 4863 #define stbir__4_coeff_start() \ 4864 stbir__simdf tot0,tot1,c,cs; \ 4865 STBIR_SIMD_NO_UNROLL(decode); \ 4866 stbir__simdf_load( cs, hc ); \ 4867 stbir__simdf_0123to0011( c, cs ); \ 4868 stbir__simdf_mult_mem( tot0, c, decode ); \ 4869 stbir__simdf_0123to2233( c, cs ); \ 4870 stbir__simdf_mult_mem( tot1, c, decode+4 ); 4871 4872 #define stbir__4_coeff_continue_from_4( ofs ) \ 4873 STBIR_SIMD_NO_UNROLL(decode); \ 4874 stbir__simdf_load( cs, hc + (ofs) ); \ 4875 stbir__simdf_0123to0011( c, cs ); \ 4876 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \ 4877 stbir__simdf_0123to2233( c, cs ); \ 4878 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 ); 4879 4880 #define stbir__1_coeff_remnant( ofs ) \ 4881 { stbir__simdf d; \ 4882 stbir__simdf_load1z( cs, hc + (ofs) ); \ 4883 stbir__simdf_0123to0011( c, cs ); \ 4884 stbir__simdf_load2( d, decode + (ofs) * 2 ); \ 4885 stbir__simdf_madd( tot0, tot0, d, c ); } 4886 4887 #define stbir__2_coeff_remnant( ofs ) \ 4888 stbir__simdf_load2( cs, hc + (ofs) ); \ 4889 stbir__simdf_0123to0011( c, cs ); \ 4890 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); 4891 4892 #define stbir__3_coeff_remnant( ofs ) \ 4893 { stbir__simdf d; \ 4894 stbir__simdf_load( cs, hc + (ofs) ); \ 4895 stbir__simdf_0123to0011( c, cs ); \ 4896 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \ 4897 stbir__simdf_0123to2222( c, cs ); \ 4898 stbir__simdf_load2z( d, decode + (ofs) * 2 + 4 ); \ 4899 stbir__simdf_madd( tot1, tot1, d, c ); } 4900 4901 #define stbir__store_output() \ 4902 stbir__simdf_add( tot0, tot0, tot1 ); \ 4903 stbir__simdf_0123to2301( c, tot0 ); \ 4904 stbir__simdf_add( tot0, tot0, c ); \ 4905 stbir__simdf_store2( output, tot0 ); \ 4906 horizontal_coefficients += coefficient_width; \ 4907 ++horizontal_contributors; \ 4908 output += 2; 4909 4910 #endif 4911 4912 #else 4913 4914 #define stbir__1_coeff_only() \ 4915 float tota,totb,c; \ 4916 c = hc[0]; \ 4917 tota = decode[0]*c; \ 4918 totb = decode[1]*c; 4919 4920 #define stbir__2_coeff_only() \ 4921 float tota,totb,c; \ 4922 c = hc[0]; \ 4923 tota = decode[0]*c; \ 4924 totb = decode[1]*c; \ 4925 c = hc[1]; \ 4926 tota += decode[2]*c; \ 4927 totb += decode[3]*c; 4928 4929 // this weird order of add matches the simd 4930 #define stbir__3_coeff_only() \ 4931 float tota,totb,c; \ 4932 c = hc[0]; \ 4933 tota = decode[0]*c; \ 4934 totb = decode[1]*c; \ 4935 c = hc[2]; \ 4936 tota += decode[4]*c; \ 4937 totb += decode[5]*c; \ 4938 c = hc[1]; \ 4939 tota += decode[2]*c; \ 4940 totb += decode[3]*c; 4941 4942 #define stbir__store_output_tiny() \ 4943 output[0] = tota; \ 4944 output[1] = totb; \ 4945 horizontal_coefficients += coefficient_width; \ 4946 ++horizontal_contributors; \ 4947 output += 2; 4948 4949 #define stbir__4_coeff_start() \ 4950 float tota0,tota1,tota2,tota3,totb0,totb1,totb2,totb3,c; \ 4951 c = hc[0]; \ 4952 tota0 = decode[0]*c; \ 4953 totb0 = decode[1]*c; \ 4954 c = hc[1]; \ 4955 tota1 = decode[2]*c; \ 4956 totb1 = decode[3]*c; \ 4957 c = hc[2]; \ 4958 tota2 = decode[4]*c; \ 4959 totb2 = decode[5]*c; \ 4960 c = hc[3]; \ 4961 tota3 = decode[6]*c; \ 4962 totb3 = decode[7]*c; 4963 4964 #define stbir__4_coeff_continue_from_4( ofs ) \ 4965 c = hc[0+(ofs)]; \ 4966 tota0 += decode[0+(ofs)*2]*c; \ 4967 totb0 += decode[1+(ofs)*2]*c; \ 4968 c = hc[1+(ofs)]; \ 4969 tota1 += decode[2+(ofs)*2]*c; \ 4970 totb1 += decode[3+(ofs)*2]*c; \ 4971 c = hc[2+(ofs)]; \ 4972 tota2 += decode[4+(ofs)*2]*c; \ 4973 totb2 += decode[5+(ofs)*2]*c; \ 4974 c = hc[3+(ofs)]; \ 4975 tota3 += decode[6+(ofs)*2]*c; \ 4976 totb3 += decode[7+(ofs)*2]*c; 4977 4978 #define stbir__1_coeff_remnant( ofs ) \ 4979 c = hc[0+(ofs)]; \ 4980 tota0 += decode[0+(ofs)*2] * c; \ 4981 totb0 += decode[1+(ofs)*2] * c; 4982 4983 #define stbir__2_coeff_remnant( ofs ) \ 4984 c = hc[0+(ofs)]; \ 4985 tota0 += decode[0+(ofs)*2] * c; \ 4986 totb0 += decode[1+(ofs)*2] * c; \ 4987 c = hc[1+(ofs)]; \ 4988 tota1 += decode[2+(ofs)*2] * c; \ 4989 totb1 += decode[3+(ofs)*2] * c; 4990 4991 #define stbir__3_coeff_remnant( ofs ) \ 4992 c = hc[0+(ofs)]; \ 4993 tota0 += decode[0+(ofs)*2] * c; \ 4994 totb0 += decode[1+(ofs)*2] * c; \ 4995 c = hc[1+(ofs)]; \ 4996 tota1 += decode[2+(ofs)*2] * c; \ 4997 totb1 += decode[3+(ofs)*2] * c; \ 4998 c = hc[2+(ofs)]; \ 4999 tota2 += decode[4+(ofs)*2] * c; \ 5000 totb2 += decode[5+(ofs)*2] * c; 5001 5002 #define stbir__store_output() \ 5003 output[0] = (tota0+tota2)+(tota1+tota3); \ 5004 output[1] = (totb0+totb2)+(totb1+totb3); \ 5005 horizontal_coefficients += coefficient_width; \ 5006 ++horizontal_contributors; \ 5007 output += 2; 5008 5009 #endif 5010 5011 #define STBIR__horizontal_channels 2 5012 #define STB_IMAGE_RESIZE_DO_HORIZONTALS 5013 #include STBIR__HEADER_FILENAME 5014 5015 5016 //================= 5017 // Do 3 channel horizontal routines 5018 5019 #ifdef STBIR_SIMD 5020 5021 #define stbir__1_coeff_only() \ 5022 stbir__simdf tot,c,d; \ 5023 STBIR_SIMD_NO_UNROLL(decode); \ 5024 stbir__simdf_load1z( c, hc ); \ 5025 stbir__simdf_0123to0001( c, c ); \ 5026 stbir__simdf_load( d, decode ); \ 5027 stbir__simdf_mult( tot, d, c ); 5028 5029 #define stbir__2_coeff_only() \ 5030 stbir__simdf tot,c,cs,d; \ 5031 STBIR_SIMD_NO_UNROLL(decode); \ 5032 stbir__simdf_load2( cs, hc ); \ 5033 stbir__simdf_0123to0000( c, cs ); \ 5034 stbir__simdf_load( d, decode ); \ 5035 stbir__simdf_mult( tot, d, c ); \ 5036 stbir__simdf_0123to1111( c, cs ); \ 5037 stbir__simdf_load( d, decode+3 ); \ 5038 stbir__simdf_madd( tot, tot, d, c ); 5039 5040 #define stbir__3_coeff_only() \ 5041 stbir__simdf tot,c,d,cs; \ 5042 STBIR_SIMD_NO_UNROLL(decode); \ 5043 stbir__simdf_load( cs, hc ); \ 5044 stbir__simdf_0123to0000( c, cs ); \ 5045 stbir__simdf_load( d, decode ); \ 5046 stbir__simdf_mult( tot, d, c ); \ 5047 stbir__simdf_0123to1111( c, cs ); \ 5048 stbir__simdf_load( d, decode+3 ); \ 5049 stbir__simdf_madd( tot, tot, d, c ); \ 5050 stbir__simdf_0123to2222( c, cs ); \ 5051 stbir__simdf_load( d, decode+6 ); \ 5052 stbir__simdf_madd( tot, tot, d, c ); 5053 5054 #define stbir__store_output_tiny() \ 5055 stbir__simdf_store2( output, tot ); \ 5056 stbir__simdf_0123to2301( tot, tot ); \ 5057 stbir__simdf_store1( output+2, tot ); \ 5058 horizontal_coefficients += coefficient_width; \ 5059 ++horizontal_contributors; \ 5060 output += 3; 5061 5062 #ifdef STBIR_SIMD8 5063 5064 // we're loading from the XXXYYY decode by -1 to get the XXXYYY into different halves of the AVX reg fyi 5065 #define stbir__4_coeff_start() \ 5066 stbir__simdf8 tot0,tot1,c,cs; stbir__simdf t; \ 5067 STBIR_SIMD_NO_UNROLL(decode); \ 5068 stbir__simdf8_load4b( cs, hc ); \ 5069 stbir__simdf8_0123to00001111( c, cs ); \ 5070 stbir__simdf8_mult_mem( tot0, c, decode - 1 ); \ 5071 stbir__simdf8_0123to22223333( c, cs ); \ 5072 stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 ); 5073 5074 #define stbir__4_coeff_continue_from_4( ofs ) \ 5075 STBIR_SIMD_NO_UNROLL(decode); \ 5076 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 5077 stbir__simdf8_0123to00001111( c, cs ); \ 5078 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \ 5079 stbir__simdf8_0123to22223333( c, cs ); \ 5080 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 ); 5081 5082 #define stbir__1_coeff_remnant( ofs ) \ 5083 STBIR_SIMD_NO_UNROLL(decode); \ 5084 stbir__simdf_load1rep4( t, hc + (ofs) ); \ 5085 stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 ); 5086 5087 #define stbir__2_coeff_remnant( ofs ) \ 5088 STBIR_SIMD_NO_UNROLL(decode); \ 5089 stbir__simdf8_load4b( cs, hc + (ofs) - 2 ); \ 5090 stbir__simdf8_0123to22223333( c, cs ); \ 5091 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); 5092 5093 #define stbir__3_coeff_remnant( ofs ) \ 5094 STBIR_SIMD_NO_UNROLL(decode); \ 5095 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 5096 stbir__simdf8_0123to00001111( c, cs ); \ 5097 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \ 5098 stbir__simdf8_0123to2222( t, cs ); \ 5099 stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 ); 5100 5101 #define stbir__store_output() \ 5102 stbir__simdf8_add( tot0, tot0, tot1 ); \ 5103 stbir__simdf_0123to1230( t, stbir__if_simdf8_cast_to_simdf4( tot0 ) ); \ 5104 stbir__simdf8_add4halves( t, t, tot0 ); \ 5105 horizontal_coefficients += coefficient_width; \ 5106 ++horizontal_contributors; \ 5107 output += 3; \ 5108 if ( output < output_end ) \ 5109 { \ 5110 stbir__simdf_store( output-3, t ); \ 5111 continue; \ 5112 } \ 5113 { stbir__simdf tt; stbir__simdf_0123to2301( tt, t ); \ 5114 stbir__simdf_store2( output-3, t ); \ 5115 stbir__simdf_store1( output+2-3, tt ); } \ 5116 break; 5117 5118 5119 #else 5120 5121 #define stbir__4_coeff_start() \ 5122 stbir__simdf tot0,tot1,tot2,c,cs; \ 5123 STBIR_SIMD_NO_UNROLL(decode); \ 5124 stbir__simdf_load( cs, hc ); \ 5125 stbir__simdf_0123to0001( c, cs ); \ 5126 stbir__simdf_mult_mem( tot0, c, decode ); \ 5127 stbir__simdf_0123to1122( c, cs ); \ 5128 stbir__simdf_mult_mem( tot1, c, decode+4 ); \ 5129 stbir__simdf_0123to2333( c, cs ); \ 5130 stbir__simdf_mult_mem( tot2, c, decode+8 ); 5131 5132 #define stbir__4_coeff_continue_from_4( ofs ) \ 5133 STBIR_SIMD_NO_UNROLL(decode); \ 5134 stbir__simdf_load( cs, hc + (ofs) ); \ 5135 stbir__simdf_0123to0001( c, cs ); \ 5136 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \ 5137 stbir__simdf_0123to1122( c, cs ); \ 5138 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \ 5139 stbir__simdf_0123to2333( c, cs ); \ 5140 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 ); 5141 5142 #define stbir__1_coeff_remnant( ofs ) \ 5143 STBIR_SIMD_NO_UNROLL(decode); \ 5144 stbir__simdf_load1z( c, hc + (ofs) ); \ 5145 stbir__simdf_0123to0001( c, c ); \ 5146 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); 5147 5148 #define stbir__2_coeff_remnant( ofs ) \ 5149 { stbir__simdf d; \ 5150 STBIR_SIMD_NO_UNROLL(decode); \ 5151 stbir__simdf_load2z( cs, hc + (ofs) ); \ 5152 stbir__simdf_0123to0001( c, cs ); \ 5153 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \ 5154 stbir__simdf_0123to1122( c, cs ); \ 5155 stbir__simdf_load2z( d, decode+(ofs)*3+4 ); \ 5156 stbir__simdf_madd( tot1, tot1, c, d ); } 5157 5158 #define stbir__3_coeff_remnant( ofs ) \ 5159 { stbir__simdf d; \ 5160 STBIR_SIMD_NO_UNROLL(decode); \ 5161 stbir__simdf_load( cs, hc + (ofs) ); \ 5162 stbir__simdf_0123to0001( c, cs ); \ 5163 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \ 5164 stbir__simdf_0123to1122( c, cs ); \ 5165 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \ 5166 stbir__simdf_0123to2222( c, cs ); \ 5167 stbir__simdf_load1z( d, decode+(ofs)*3+8 ); \ 5168 stbir__simdf_madd( tot2, tot2, c, d ); } 5169 5170 #define stbir__store_output() \ 5171 stbir__simdf_0123ABCDto3ABx( c, tot0, tot1 ); \ 5172 stbir__simdf_0123ABCDto23Ax( cs, tot1, tot2 ); \ 5173 stbir__simdf_0123to1230( tot2, tot2 ); \ 5174 stbir__simdf_add( tot0, tot0, cs ); \ 5175 stbir__simdf_add( c, c, tot2 ); \ 5176 stbir__simdf_add( tot0, tot0, c ); \ 5177 horizontal_coefficients += coefficient_width; \ 5178 ++horizontal_contributors; \ 5179 output += 3; \ 5180 if ( output < output_end ) \ 5181 { \ 5182 stbir__simdf_store( output-3, tot0 ); \ 5183 continue; \ 5184 } \ 5185 stbir__simdf_0123to2301( tot1, tot0 ); \ 5186 stbir__simdf_store2( output-3, tot0 ); \ 5187 stbir__simdf_store1( output+2-3, tot1 ); \ 5188 break; 5189 5190 #endif 5191 5192 #else 5193 5194 #define stbir__1_coeff_only() \ 5195 float tot0, tot1, tot2, c; \ 5196 c = hc[0]; \ 5197 tot0 = decode[0]*c; \ 5198 tot1 = decode[1]*c; \ 5199 tot2 = decode[2]*c; 5200 5201 #define stbir__2_coeff_only() \ 5202 float tot0, tot1, tot2, c; \ 5203 c = hc[0]; \ 5204 tot0 = decode[0]*c; \ 5205 tot1 = decode[1]*c; \ 5206 tot2 = decode[2]*c; \ 5207 c = hc[1]; \ 5208 tot0 += decode[3]*c; \ 5209 tot1 += decode[4]*c; \ 5210 tot2 += decode[5]*c; 5211 5212 #define stbir__3_coeff_only() \ 5213 float tot0, tot1, tot2, c; \ 5214 c = hc[0]; \ 5215 tot0 = decode[0]*c; \ 5216 tot1 = decode[1]*c; \ 5217 tot2 = decode[2]*c; \ 5218 c = hc[1]; \ 5219 tot0 += decode[3]*c; \ 5220 tot1 += decode[4]*c; \ 5221 tot2 += decode[5]*c; \ 5222 c = hc[2]; \ 5223 tot0 += decode[6]*c; \ 5224 tot1 += decode[7]*c; \ 5225 tot2 += decode[8]*c; 5226 5227 #define stbir__store_output_tiny() \ 5228 output[0] = tot0; \ 5229 output[1] = tot1; \ 5230 output[2] = tot2; \ 5231 horizontal_coefficients += coefficient_width; \ 5232 ++horizontal_contributors; \ 5233 output += 3; 5234 5235 #define stbir__4_coeff_start() \ 5236 float tota0,tota1,tota2,totb0,totb1,totb2,totc0,totc1,totc2,totd0,totd1,totd2,c; \ 5237 c = hc[0]; \ 5238 tota0 = decode[0]*c; \ 5239 tota1 = decode[1]*c; \ 5240 tota2 = decode[2]*c; \ 5241 c = hc[1]; \ 5242 totb0 = decode[3]*c; \ 5243 totb1 = decode[4]*c; \ 5244 totb2 = decode[5]*c; \ 5245 c = hc[2]; \ 5246 totc0 = decode[6]*c; \ 5247 totc1 = decode[7]*c; \ 5248 totc2 = decode[8]*c; \ 5249 c = hc[3]; \ 5250 totd0 = decode[9]*c; \ 5251 totd1 = decode[10]*c; \ 5252 totd2 = decode[11]*c; 5253 5254 #define stbir__4_coeff_continue_from_4( ofs ) \ 5255 c = hc[0+(ofs)]; \ 5256 tota0 += decode[0+(ofs)*3]*c; \ 5257 tota1 += decode[1+(ofs)*3]*c; \ 5258 tota2 += decode[2+(ofs)*3]*c; \ 5259 c = hc[1+(ofs)]; \ 5260 totb0 += decode[3+(ofs)*3]*c; \ 5261 totb1 += decode[4+(ofs)*3]*c; \ 5262 totb2 += decode[5+(ofs)*3]*c; \ 5263 c = hc[2+(ofs)]; \ 5264 totc0 += decode[6+(ofs)*3]*c; \ 5265 totc1 += decode[7+(ofs)*3]*c; \ 5266 totc2 += decode[8+(ofs)*3]*c; \ 5267 c = hc[3+(ofs)]; \ 5268 totd0 += decode[9+(ofs)*3]*c; \ 5269 totd1 += decode[10+(ofs)*3]*c; \ 5270 totd2 += decode[11+(ofs)*3]*c; 5271 5272 #define stbir__1_coeff_remnant( ofs ) \ 5273 c = hc[0+(ofs)]; \ 5274 tota0 += decode[0+(ofs)*3]*c; \ 5275 tota1 += decode[1+(ofs)*3]*c; \ 5276 tota2 += decode[2+(ofs)*3]*c; 5277 5278 #define stbir__2_coeff_remnant( ofs ) \ 5279 c = hc[0+(ofs)]; \ 5280 tota0 += decode[0+(ofs)*3]*c; \ 5281 tota1 += decode[1+(ofs)*3]*c; \ 5282 tota2 += decode[2+(ofs)*3]*c; \ 5283 c = hc[1+(ofs)]; \ 5284 totb0 += decode[3+(ofs)*3]*c; \ 5285 totb1 += decode[4+(ofs)*3]*c; \ 5286 totb2 += decode[5+(ofs)*3]*c; \ 5287 5288 #define stbir__3_coeff_remnant( ofs ) \ 5289 c = hc[0+(ofs)]; \ 5290 tota0 += decode[0+(ofs)*3]*c; \ 5291 tota1 += decode[1+(ofs)*3]*c; \ 5292 tota2 += decode[2+(ofs)*3]*c; \ 5293 c = hc[1+(ofs)]; \ 5294 totb0 += decode[3+(ofs)*3]*c; \ 5295 totb1 += decode[4+(ofs)*3]*c; \ 5296 totb2 += decode[5+(ofs)*3]*c; \ 5297 c = hc[2+(ofs)]; \ 5298 totc0 += decode[6+(ofs)*3]*c; \ 5299 totc1 += decode[7+(ofs)*3]*c; \ 5300 totc2 += decode[8+(ofs)*3]*c; 5301 5302 #define stbir__store_output() \ 5303 output[0] = (tota0+totc0)+(totb0+totd0); \ 5304 output[1] = (tota1+totc1)+(totb1+totd1); \ 5305 output[2] = (tota2+totc2)+(totb2+totd2); \ 5306 horizontal_coefficients += coefficient_width; \ 5307 ++horizontal_contributors; \ 5308 output += 3; 5309 5310 #endif 5311 5312 #define STBIR__horizontal_channels 3 5313 #define STB_IMAGE_RESIZE_DO_HORIZONTALS 5314 #include STBIR__HEADER_FILENAME 5315 5316 //================= 5317 // Do 4 channel horizontal routines 5318 5319 #ifdef STBIR_SIMD 5320 5321 #define stbir__1_coeff_only() \ 5322 stbir__simdf tot,c; \ 5323 STBIR_SIMD_NO_UNROLL(decode); \ 5324 stbir__simdf_load1( c, hc ); \ 5325 stbir__simdf_0123to0000( c, c ); \ 5326 stbir__simdf_mult_mem( tot, c, decode ); 5327 5328 #define stbir__2_coeff_only() \ 5329 stbir__simdf tot,c,cs; \ 5330 STBIR_SIMD_NO_UNROLL(decode); \ 5331 stbir__simdf_load2( cs, hc ); \ 5332 stbir__simdf_0123to0000( c, cs ); \ 5333 stbir__simdf_mult_mem( tot, c, decode ); \ 5334 stbir__simdf_0123to1111( c, cs ); \ 5335 stbir__simdf_madd_mem( tot, tot, c, decode+4 ); 5336 5337 #define stbir__3_coeff_only() \ 5338 stbir__simdf tot,c,cs; \ 5339 STBIR_SIMD_NO_UNROLL(decode); \ 5340 stbir__simdf_load( cs, hc ); \ 5341 stbir__simdf_0123to0000( c, cs ); \ 5342 stbir__simdf_mult_mem( tot, c, decode ); \ 5343 stbir__simdf_0123to1111( c, cs ); \ 5344 stbir__simdf_madd_mem( tot, tot, c, decode+4 ); \ 5345 stbir__simdf_0123to2222( c, cs ); \ 5346 stbir__simdf_madd_mem( tot, tot, c, decode+8 ); 5347 5348 #define stbir__store_output_tiny() \ 5349 stbir__simdf_store( output, tot ); \ 5350 horizontal_coefficients += coefficient_width; \ 5351 ++horizontal_contributors; \ 5352 output += 4; 5353 5354 #ifdef STBIR_SIMD8 5355 5356 #define stbir__4_coeff_start() \ 5357 stbir__simdf8 tot0,c,cs; stbir__simdf t; \ 5358 STBIR_SIMD_NO_UNROLL(decode); \ 5359 stbir__simdf8_load4b( cs, hc ); \ 5360 stbir__simdf8_0123to00001111( c, cs ); \ 5361 stbir__simdf8_mult_mem( tot0, c, decode ); \ 5362 stbir__simdf8_0123to22223333( c, cs ); \ 5363 stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 ); 5364 5365 #define stbir__4_coeff_continue_from_4( ofs ) \ 5366 STBIR_SIMD_NO_UNROLL(decode); \ 5367 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 5368 stbir__simdf8_0123to00001111( c, cs ); \ 5369 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ 5370 stbir__simdf8_0123to22223333( c, cs ); \ 5371 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); 5372 5373 #define stbir__1_coeff_remnant( ofs ) \ 5374 STBIR_SIMD_NO_UNROLL(decode); \ 5375 stbir__simdf_load1rep4( t, hc + (ofs) ); \ 5376 stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 ); 5377 5378 #define stbir__2_coeff_remnant( ofs ) \ 5379 STBIR_SIMD_NO_UNROLL(decode); \ 5380 stbir__simdf8_load4b( cs, hc + (ofs) - 2 ); \ 5381 stbir__simdf8_0123to22223333( c, cs ); \ 5382 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); 5383 5384 #define stbir__3_coeff_remnant( ofs ) \ 5385 STBIR_SIMD_NO_UNROLL(decode); \ 5386 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 5387 stbir__simdf8_0123to00001111( c, cs ); \ 5388 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ 5389 stbir__simdf8_0123to2222( t, cs ); \ 5390 stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 ); 5391 5392 #define stbir__store_output() \ 5393 stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 ); \ 5394 stbir__simdf_store( output, t ); \ 5395 horizontal_coefficients += coefficient_width; \ 5396 ++horizontal_contributors; \ 5397 output += 4; 5398 5399 #else 5400 5401 #define stbir__4_coeff_start() \ 5402 stbir__simdf tot0,tot1,c,cs; \ 5403 STBIR_SIMD_NO_UNROLL(decode); \ 5404 stbir__simdf_load( cs, hc ); \ 5405 stbir__simdf_0123to0000( c, cs ); \ 5406 stbir__simdf_mult_mem( tot0, c, decode ); \ 5407 stbir__simdf_0123to1111( c, cs ); \ 5408 stbir__simdf_mult_mem( tot1, c, decode+4 ); \ 5409 stbir__simdf_0123to2222( c, cs ); \ 5410 stbir__simdf_madd_mem( tot0, tot0, c, decode+8 ); \ 5411 stbir__simdf_0123to3333( c, cs ); \ 5412 stbir__simdf_madd_mem( tot1, tot1, c, decode+12 ); 5413 5414 #define stbir__4_coeff_continue_from_4( ofs ) \ 5415 STBIR_SIMD_NO_UNROLL(decode); \ 5416 stbir__simdf_load( cs, hc + (ofs) ); \ 5417 stbir__simdf_0123to0000( c, cs ); \ 5418 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ 5419 stbir__simdf_0123to1111( c, cs ); \ 5420 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); \ 5421 stbir__simdf_0123to2222( c, cs ); \ 5422 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); \ 5423 stbir__simdf_0123to3333( c, cs ); \ 5424 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 ); 5425 5426 #define stbir__1_coeff_remnant( ofs ) \ 5427 STBIR_SIMD_NO_UNROLL(decode); \ 5428 stbir__simdf_load1( c, hc + (ofs) ); \ 5429 stbir__simdf_0123to0000( c, c ); \ 5430 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); 5431 5432 #define stbir__2_coeff_remnant( ofs ) \ 5433 STBIR_SIMD_NO_UNROLL(decode); \ 5434 stbir__simdf_load2( cs, hc + (ofs) ); \ 5435 stbir__simdf_0123to0000( c, cs ); \ 5436 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ 5437 stbir__simdf_0123to1111( c, cs ); \ 5438 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); 5439 5440 #define stbir__3_coeff_remnant( ofs ) \ 5441 STBIR_SIMD_NO_UNROLL(decode); \ 5442 stbir__simdf_load( cs, hc + (ofs) ); \ 5443 stbir__simdf_0123to0000( c, cs ); \ 5444 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ 5445 stbir__simdf_0123to1111( c, cs ); \ 5446 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); \ 5447 stbir__simdf_0123to2222( c, cs ); \ 5448 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); 5449 5450 #define stbir__store_output() \ 5451 stbir__simdf_add( tot0, tot0, tot1 ); \ 5452 stbir__simdf_store( output, tot0 ); \ 5453 horizontal_coefficients += coefficient_width; \ 5454 ++horizontal_contributors; \ 5455 output += 4; 5456 5457 #endif 5458 5459 #else 5460 5461 #define stbir__1_coeff_only() \ 5462 float p0,p1,p2,p3,c; \ 5463 STBIR_SIMD_NO_UNROLL(decode); \ 5464 c = hc[0]; \ 5465 p0 = decode[0] * c; \ 5466 p1 = decode[1] * c; \ 5467 p2 = decode[2] * c; \ 5468 p3 = decode[3] * c; 5469 5470 #define stbir__2_coeff_only() \ 5471 float p0,p1,p2,p3,c; \ 5472 STBIR_SIMD_NO_UNROLL(decode); \ 5473 c = hc[0]; \ 5474 p0 = decode[0] * c; \ 5475 p1 = decode[1] * c; \ 5476 p2 = decode[2] * c; \ 5477 p3 = decode[3] * c; \ 5478 c = hc[1]; \ 5479 p0 += decode[4] * c; \ 5480 p1 += decode[5] * c; \ 5481 p2 += decode[6] * c; \ 5482 p3 += decode[7] * c; 5483 5484 #define stbir__3_coeff_only() \ 5485 float p0,p1,p2,p3,c; \ 5486 STBIR_SIMD_NO_UNROLL(decode); \ 5487 c = hc[0]; \ 5488 p0 = decode[0] * c; \ 5489 p1 = decode[1] * c; \ 5490 p2 = decode[2] * c; \ 5491 p3 = decode[3] * c; \ 5492 c = hc[1]; \ 5493 p0 += decode[4] * c; \ 5494 p1 += decode[5] * c; \ 5495 p2 += decode[6] * c; \ 5496 p3 += decode[7] * c; \ 5497 c = hc[2]; \ 5498 p0 += decode[8] * c; \ 5499 p1 += decode[9] * c; \ 5500 p2 += decode[10] * c; \ 5501 p3 += decode[11] * c; 5502 5503 #define stbir__store_output_tiny() \ 5504 output[0] = p0; \ 5505 output[1] = p1; \ 5506 output[2] = p2; \ 5507 output[3] = p3; \ 5508 horizontal_coefficients += coefficient_width; \ 5509 ++horizontal_contributors; \ 5510 output += 4; 5511 5512 #define stbir__4_coeff_start() \ 5513 float x0,x1,x2,x3,y0,y1,y2,y3,c; \ 5514 STBIR_SIMD_NO_UNROLL(decode); \ 5515 c = hc[0]; \ 5516 x0 = decode[0] * c; \ 5517 x1 = decode[1] * c; \ 5518 x2 = decode[2] * c; \ 5519 x3 = decode[3] * c; \ 5520 c = hc[1]; \ 5521 y0 = decode[4] * c; \ 5522 y1 = decode[5] * c; \ 5523 y2 = decode[6] * c; \ 5524 y3 = decode[7] * c; \ 5525 c = hc[2]; \ 5526 x0 += decode[8] * c; \ 5527 x1 += decode[9] * c; \ 5528 x2 += decode[10] * c; \ 5529 x3 += decode[11] * c; \ 5530 c = hc[3]; \ 5531 y0 += decode[12] * c; \ 5532 y1 += decode[13] * c; \ 5533 y2 += decode[14] * c; \ 5534 y3 += decode[15] * c; 5535 5536 #define stbir__4_coeff_continue_from_4( ofs ) \ 5537 STBIR_SIMD_NO_UNROLL(decode); \ 5538 c = hc[0+(ofs)]; \ 5539 x0 += decode[0+(ofs)*4] * c; \ 5540 x1 += decode[1+(ofs)*4] * c; \ 5541 x2 += decode[2+(ofs)*4] * c; \ 5542 x3 += decode[3+(ofs)*4] * c; \ 5543 c = hc[1+(ofs)]; \ 5544 y0 += decode[4+(ofs)*4] * c; \ 5545 y1 += decode[5+(ofs)*4] * c; \ 5546 y2 += decode[6+(ofs)*4] * c; \ 5547 y3 += decode[7+(ofs)*4] * c; \ 5548 c = hc[2+(ofs)]; \ 5549 x0 += decode[8+(ofs)*4] * c; \ 5550 x1 += decode[9+(ofs)*4] * c; \ 5551 x2 += decode[10+(ofs)*4] * c; \ 5552 x3 += decode[11+(ofs)*4] * c; \ 5553 c = hc[3+(ofs)]; \ 5554 y0 += decode[12+(ofs)*4] * c; \ 5555 y1 += decode[13+(ofs)*4] * c; \ 5556 y2 += decode[14+(ofs)*4] * c; \ 5557 y3 += decode[15+(ofs)*4] * c; 5558 5559 #define stbir__1_coeff_remnant( ofs ) \ 5560 STBIR_SIMD_NO_UNROLL(decode); \ 5561 c = hc[0+(ofs)]; \ 5562 x0 += decode[0+(ofs)*4] * c; \ 5563 x1 += decode[1+(ofs)*4] * c; \ 5564 x2 += decode[2+(ofs)*4] * c; \ 5565 x3 += decode[3+(ofs)*4] * c; 5566 5567 #define stbir__2_coeff_remnant( ofs ) \ 5568 STBIR_SIMD_NO_UNROLL(decode); \ 5569 c = hc[0+(ofs)]; \ 5570 x0 += decode[0+(ofs)*4] * c; \ 5571 x1 += decode[1+(ofs)*4] * c; \ 5572 x2 += decode[2+(ofs)*4] * c; \ 5573 x3 += decode[3+(ofs)*4] * c; \ 5574 c = hc[1+(ofs)]; \ 5575 y0 += decode[4+(ofs)*4] * c; \ 5576 y1 += decode[5+(ofs)*4] * c; \ 5577 y2 += decode[6+(ofs)*4] * c; \ 5578 y3 += decode[7+(ofs)*4] * c; 5579 5580 #define stbir__3_coeff_remnant( ofs ) \ 5581 STBIR_SIMD_NO_UNROLL(decode); \ 5582 c = hc[0+(ofs)]; \ 5583 x0 += decode[0+(ofs)*4] * c; \ 5584 x1 += decode[1+(ofs)*4] * c; \ 5585 x2 += decode[2+(ofs)*4] * c; \ 5586 x3 += decode[3+(ofs)*4] * c; \ 5587 c = hc[1+(ofs)]; \ 5588 y0 += decode[4+(ofs)*4] * c; \ 5589 y1 += decode[5+(ofs)*4] * c; \ 5590 y2 += decode[6+(ofs)*4] * c; \ 5591 y3 += decode[7+(ofs)*4] * c; \ 5592 c = hc[2+(ofs)]; \ 5593 x0 += decode[8+(ofs)*4] * c; \ 5594 x1 += decode[9+(ofs)*4] * c; \ 5595 x2 += decode[10+(ofs)*4] * c; \ 5596 x3 += decode[11+(ofs)*4] * c; 5597 5598 #define stbir__store_output() \ 5599 output[0] = x0 + y0; \ 5600 output[1] = x1 + y1; \ 5601 output[2] = x2 + y2; \ 5602 output[3] = x3 + y3; \ 5603 horizontal_coefficients += coefficient_width; \ 5604 ++horizontal_contributors; \ 5605 output += 4; 5606 5607 #endif 5608 5609 #define STBIR__horizontal_channels 4 5610 #define STB_IMAGE_RESIZE_DO_HORIZONTALS 5611 #include STBIR__HEADER_FILENAME 5612 5613 5614 5615 //================= 5616 // Do 7 channel horizontal routines 5617 5618 #ifdef STBIR_SIMD 5619 5620 #define stbir__1_coeff_only() \ 5621 stbir__simdf tot0,tot1,c; \ 5622 STBIR_SIMD_NO_UNROLL(decode); \ 5623 stbir__simdf_load1( c, hc ); \ 5624 stbir__simdf_0123to0000( c, c ); \ 5625 stbir__simdf_mult_mem( tot0, c, decode ); \ 5626 stbir__simdf_mult_mem( tot1, c, decode+3 ); 5627 5628 #define stbir__2_coeff_only() \ 5629 stbir__simdf tot0,tot1,c,cs; \ 5630 STBIR_SIMD_NO_UNROLL(decode); \ 5631 stbir__simdf_load2( cs, hc ); \ 5632 stbir__simdf_0123to0000( c, cs ); \ 5633 stbir__simdf_mult_mem( tot0, c, decode ); \ 5634 stbir__simdf_mult_mem( tot1, c, decode+3 ); \ 5635 stbir__simdf_0123to1111( c, cs ); \ 5636 stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \ 5637 stbir__simdf_madd_mem( tot1, tot1, c,decode+10 ); 5638 5639 #define stbir__3_coeff_only() \ 5640 stbir__simdf tot0,tot1,c,cs; \ 5641 STBIR_SIMD_NO_UNROLL(decode); \ 5642 stbir__simdf_load( cs, hc ); \ 5643 stbir__simdf_0123to0000( c, cs ); \ 5644 stbir__simdf_mult_mem( tot0, c, decode ); \ 5645 stbir__simdf_mult_mem( tot1, c, decode+3 ); \ 5646 stbir__simdf_0123to1111( c, cs ); \ 5647 stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \ 5648 stbir__simdf_madd_mem( tot1, tot1, c, decode+10 ); \ 5649 stbir__simdf_0123to2222( c, cs ); \ 5650 stbir__simdf_madd_mem( tot0, tot0, c, decode+14 ); \ 5651 stbir__simdf_madd_mem( tot1, tot1, c, decode+17 ); 5652 5653 #define stbir__store_output_tiny() \ 5654 stbir__simdf_store( output+3, tot1 ); \ 5655 stbir__simdf_store( output, tot0 ); \ 5656 horizontal_coefficients += coefficient_width; \ 5657 ++horizontal_contributors; \ 5658 output += 7; 5659 5660 #ifdef STBIR_SIMD8 5661 5662 #define stbir__4_coeff_start() \ 5663 stbir__simdf8 tot0,tot1,c,cs; \ 5664 STBIR_SIMD_NO_UNROLL(decode); \ 5665 stbir__simdf8_load4b( cs, hc ); \ 5666 stbir__simdf8_0123to00000000( c, cs ); \ 5667 stbir__simdf8_mult_mem( tot0, c, decode ); \ 5668 stbir__simdf8_0123to11111111( c, cs ); \ 5669 stbir__simdf8_mult_mem( tot1, c, decode+7 ); \ 5670 stbir__simdf8_0123to22222222( c, cs ); \ 5671 stbir__simdf8_madd_mem( tot0, tot0, c, decode+14 ); \ 5672 stbir__simdf8_0123to33333333( c, cs ); \ 5673 stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 ); 5674 5675 #define stbir__4_coeff_continue_from_4( ofs ) \ 5676 STBIR_SIMD_NO_UNROLL(decode); \ 5677 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 5678 stbir__simdf8_0123to00000000( c, cs ); \ 5679 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5680 stbir__simdf8_0123to11111111( c, cs ); \ 5681 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); \ 5682 stbir__simdf8_0123to22222222( c, cs ); \ 5683 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \ 5684 stbir__simdf8_0123to33333333( c, cs ); \ 5685 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 ); 5686 5687 #define stbir__1_coeff_remnant( ofs ) \ 5688 STBIR_SIMD_NO_UNROLL(decode); \ 5689 stbir__simdf8_load1b( c, hc + (ofs) ); \ 5690 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); 5691 5692 #define stbir__2_coeff_remnant( ofs ) \ 5693 STBIR_SIMD_NO_UNROLL(decode); \ 5694 stbir__simdf8_load1b( c, hc + (ofs) ); \ 5695 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5696 stbir__simdf8_load1b( c, hc + (ofs)+1 ); \ 5697 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); 5698 5699 #define stbir__3_coeff_remnant( ofs ) \ 5700 STBIR_SIMD_NO_UNROLL(decode); \ 5701 stbir__simdf8_load4b( cs, hc + (ofs) ); \ 5702 stbir__simdf8_0123to00000000( c, cs ); \ 5703 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5704 stbir__simdf8_0123to11111111( c, cs ); \ 5705 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); \ 5706 stbir__simdf8_0123to22222222( c, cs ); \ 5707 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); 5708 5709 #define stbir__store_output() \ 5710 stbir__simdf8_add( tot0, tot0, tot1 ); \ 5711 horizontal_coefficients += coefficient_width; \ 5712 ++horizontal_contributors; \ 5713 output += 7; \ 5714 if ( output < output_end ) \ 5715 { \ 5716 stbir__simdf8_store( output-7, tot0 ); \ 5717 continue; \ 5718 } \ 5719 stbir__simdf_store( output-7+3, stbir__simdf_swiz(stbir__simdf8_gettop4(tot0),0,0,1,2) ); \ 5720 stbir__simdf_store( output-7, stbir__if_simdf8_cast_to_simdf4(tot0) ); \ 5721 break; 5722 5723 #else 5724 5725 #define stbir__4_coeff_start() \ 5726 stbir__simdf tot0,tot1,tot2,tot3,c,cs; \ 5727 STBIR_SIMD_NO_UNROLL(decode); \ 5728 stbir__simdf_load( cs, hc ); \ 5729 stbir__simdf_0123to0000( c, cs ); \ 5730 stbir__simdf_mult_mem( tot0, c, decode ); \ 5731 stbir__simdf_mult_mem( tot1, c, decode+3 ); \ 5732 stbir__simdf_0123to1111( c, cs ); \ 5733 stbir__simdf_mult_mem( tot2, c, decode+7 ); \ 5734 stbir__simdf_mult_mem( tot3, c, decode+10 ); \ 5735 stbir__simdf_0123to2222( c, cs ); \ 5736 stbir__simdf_madd_mem( tot0, tot0, c, decode+14 ); \ 5737 stbir__simdf_madd_mem( tot1, tot1, c, decode+17 ); \ 5738 stbir__simdf_0123to3333( c, cs ); \ 5739 stbir__simdf_madd_mem( tot2, tot2, c, decode+21 ); \ 5740 stbir__simdf_madd_mem( tot3, tot3, c, decode+24 ); 5741 5742 #define stbir__4_coeff_continue_from_4( ofs ) \ 5743 STBIR_SIMD_NO_UNROLL(decode); \ 5744 stbir__simdf_load( cs, hc + (ofs) ); \ 5745 stbir__simdf_0123to0000( c, cs ); \ 5746 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5747 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \ 5748 stbir__simdf_0123to1111( c, cs ); \ 5749 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \ 5750 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); \ 5751 stbir__simdf_0123to2222( c, cs ); \ 5752 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \ 5753 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 ); \ 5754 stbir__simdf_0123to3333( c, cs ); \ 5755 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+21 ); \ 5756 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 ); 5757 5758 #define stbir__1_coeff_remnant( ofs ) \ 5759 STBIR_SIMD_NO_UNROLL(decode); \ 5760 stbir__simdf_load1( c, hc + (ofs) ); \ 5761 stbir__simdf_0123to0000( c, c ); \ 5762 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5763 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \ 5764 5765 #define stbir__2_coeff_remnant( ofs ) \ 5766 STBIR_SIMD_NO_UNROLL(decode); \ 5767 stbir__simdf_load2( cs, hc + (ofs) ); \ 5768 stbir__simdf_0123to0000( c, cs ); \ 5769 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5770 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \ 5771 stbir__simdf_0123to1111( c, cs ); \ 5772 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \ 5773 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); 5774 5775 #define stbir__3_coeff_remnant( ofs ) \ 5776 STBIR_SIMD_NO_UNROLL(decode); \ 5777 stbir__simdf_load( cs, hc + (ofs) ); \ 5778 stbir__simdf_0123to0000( c, cs ); \ 5779 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ 5780 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \ 5781 stbir__simdf_0123to1111( c, cs ); \ 5782 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \ 5783 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); \ 5784 stbir__simdf_0123to2222( c, cs ); \ 5785 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \ 5786 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 ); 5787 5788 #define stbir__store_output() \ 5789 stbir__simdf_add( tot0, tot0, tot2 ); \ 5790 stbir__simdf_add( tot1, tot1, tot3 ); \ 5791 stbir__simdf_store( output+3, tot1 ); \ 5792 stbir__simdf_store( output, tot0 ); \ 5793 horizontal_coefficients += coefficient_width; \ 5794 ++horizontal_contributors; \ 5795 output += 7; 5796 5797 #endif 5798 5799 #else 5800 5801 #define stbir__1_coeff_only() \ 5802 float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \ 5803 c = hc[0]; \ 5804 tot0 = decode[0]*c; \ 5805 tot1 = decode[1]*c; \ 5806 tot2 = decode[2]*c; \ 5807 tot3 = decode[3]*c; \ 5808 tot4 = decode[4]*c; \ 5809 tot5 = decode[5]*c; \ 5810 tot6 = decode[6]*c; 5811 5812 #define stbir__2_coeff_only() \ 5813 float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \ 5814 c = hc[0]; \ 5815 tot0 = decode[0]*c; \ 5816 tot1 = decode[1]*c; \ 5817 tot2 = decode[2]*c; \ 5818 tot3 = decode[3]*c; \ 5819 tot4 = decode[4]*c; \ 5820 tot5 = decode[5]*c; \ 5821 tot6 = decode[6]*c; \ 5822 c = hc[1]; \ 5823 tot0 += decode[7]*c; \ 5824 tot1 += decode[8]*c; \ 5825 tot2 += decode[9]*c; \ 5826 tot3 += decode[10]*c; \ 5827 tot4 += decode[11]*c; \ 5828 tot5 += decode[12]*c; \ 5829 tot6 += decode[13]*c; \ 5830 5831 #define stbir__3_coeff_only() \ 5832 float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \ 5833 c = hc[0]; \ 5834 tot0 = decode[0]*c; \ 5835 tot1 = decode[1]*c; \ 5836 tot2 = decode[2]*c; \ 5837 tot3 = decode[3]*c; \ 5838 tot4 = decode[4]*c; \ 5839 tot5 = decode[5]*c; \ 5840 tot6 = decode[6]*c; \ 5841 c = hc[1]; \ 5842 tot0 += decode[7]*c; \ 5843 tot1 += decode[8]*c; \ 5844 tot2 += decode[9]*c; \ 5845 tot3 += decode[10]*c; \ 5846 tot4 += decode[11]*c; \ 5847 tot5 += decode[12]*c; \ 5848 tot6 += decode[13]*c; \ 5849 c = hc[2]; \ 5850 tot0 += decode[14]*c; \ 5851 tot1 += decode[15]*c; \ 5852 tot2 += decode[16]*c; \ 5853 tot3 += decode[17]*c; \ 5854 tot4 += decode[18]*c; \ 5855 tot5 += decode[19]*c; \ 5856 tot6 += decode[20]*c; \ 5857 5858 #define stbir__store_output_tiny() \ 5859 output[0] = tot0; \ 5860 output[1] = tot1; \ 5861 output[2] = tot2; \ 5862 output[3] = tot3; \ 5863 output[4] = tot4; \ 5864 output[5] = tot5; \ 5865 output[6] = tot6; \ 5866 horizontal_coefficients += coefficient_width; \ 5867 ++horizontal_contributors; \ 5868 output += 7; 5869 5870 #define stbir__4_coeff_start() \ 5871 float x0,x1,x2,x3,x4,x5,x6,y0,y1,y2,y3,y4,y5,y6,c; \ 5872 STBIR_SIMD_NO_UNROLL(decode); \ 5873 c = hc[0]; \ 5874 x0 = decode[0] * c; \ 5875 x1 = decode[1] * c; \ 5876 x2 = decode[2] * c; \ 5877 x3 = decode[3] * c; \ 5878 x4 = decode[4] * c; \ 5879 x5 = decode[5] * c; \ 5880 x6 = decode[6] * c; \ 5881 c = hc[1]; \ 5882 y0 = decode[7] * c; \ 5883 y1 = decode[8] * c; \ 5884 y2 = decode[9] * c; \ 5885 y3 = decode[10] * c; \ 5886 y4 = decode[11] * c; \ 5887 y5 = decode[12] * c; \ 5888 y6 = decode[13] * c; \ 5889 c = hc[2]; \ 5890 x0 += decode[14] * c; \ 5891 x1 += decode[15] * c; \ 5892 x2 += decode[16] * c; \ 5893 x3 += decode[17] * c; \ 5894 x4 += decode[18] * c; \ 5895 x5 += decode[19] * c; \ 5896 x6 += decode[20] * c; \ 5897 c = hc[3]; \ 5898 y0 += decode[21] * c; \ 5899 y1 += decode[22] * c; \ 5900 y2 += decode[23] * c; \ 5901 y3 += decode[24] * c; \ 5902 y4 += decode[25] * c; \ 5903 y5 += decode[26] * c; \ 5904 y6 += decode[27] * c; 5905 5906 #define stbir__4_coeff_continue_from_4( ofs ) \ 5907 STBIR_SIMD_NO_UNROLL(decode); \ 5908 c = hc[0+(ofs)]; \ 5909 x0 += decode[0+(ofs)*7] * c; \ 5910 x1 += decode[1+(ofs)*7] * c; \ 5911 x2 += decode[2+(ofs)*7] * c; \ 5912 x3 += decode[3+(ofs)*7] * c; \ 5913 x4 += decode[4+(ofs)*7] * c; \ 5914 x5 += decode[5+(ofs)*7] * c; \ 5915 x6 += decode[6+(ofs)*7] * c; \ 5916 c = hc[1+(ofs)]; \ 5917 y0 += decode[7+(ofs)*7] * c; \ 5918 y1 += decode[8+(ofs)*7] * c; \ 5919 y2 += decode[9+(ofs)*7] * c; \ 5920 y3 += decode[10+(ofs)*7] * c; \ 5921 y4 += decode[11+(ofs)*7] * c; \ 5922 y5 += decode[12+(ofs)*7] * c; \ 5923 y6 += decode[13+(ofs)*7] * c; \ 5924 c = hc[2+(ofs)]; \ 5925 x0 += decode[14+(ofs)*7] * c; \ 5926 x1 += decode[15+(ofs)*7] * c; \ 5927 x2 += decode[16+(ofs)*7] * c; \ 5928 x3 += decode[17+(ofs)*7] * c; \ 5929 x4 += decode[18+(ofs)*7] * c; \ 5930 x5 += decode[19+(ofs)*7] * c; \ 5931 x6 += decode[20+(ofs)*7] * c; \ 5932 c = hc[3+(ofs)]; \ 5933 y0 += decode[21+(ofs)*7] * c; \ 5934 y1 += decode[22+(ofs)*7] * c; \ 5935 y2 += decode[23+(ofs)*7] * c; \ 5936 y3 += decode[24+(ofs)*7] * c; \ 5937 y4 += decode[25+(ofs)*7] * c; \ 5938 y5 += decode[26+(ofs)*7] * c; \ 5939 y6 += decode[27+(ofs)*7] * c; 5940 5941 #define stbir__1_coeff_remnant( ofs ) \ 5942 STBIR_SIMD_NO_UNROLL(decode); \ 5943 c = hc[0+(ofs)]; \ 5944 x0 += decode[0+(ofs)*7] * c; \ 5945 x1 += decode[1+(ofs)*7] * c; \ 5946 x2 += decode[2+(ofs)*7] * c; \ 5947 x3 += decode[3+(ofs)*7] * c; \ 5948 x4 += decode[4+(ofs)*7] * c; \ 5949 x5 += decode[5+(ofs)*7] * c; \ 5950 x6 += decode[6+(ofs)*7] * c; \ 5951 5952 #define stbir__2_coeff_remnant( ofs ) \ 5953 STBIR_SIMD_NO_UNROLL(decode); \ 5954 c = hc[0+(ofs)]; \ 5955 x0 += decode[0+(ofs)*7] * c; \ 5956 x1 += decode[1+(ofs)*7] * c; \ 5957 x2 += decode[2+(ofs)*7] * c; \ 5958 x3 += decode[3+(ofs)*7] * c; \ 5959 x4 += decode[4+(ofs)*7] * c; \ 5960 x5 += decode[5+(ofs)*7] * c; \ 5961 x6 += decode[6+(ofs)*7] * c; \ 5962 c = hc[1+(ofs)]; \ 5963 y0 += decode[7+(ofs)*7] * c; \ 5964 y1 += decode[8+(ofs)*7] * c; \ 5965 y2 += decode[9+(ofs)*7] * c; \ 5966 y3 += decode[10+(ofs)*7] * c; \ 5967 y4 += decode[11+(ofs)*7] * c; \ 5968 y5 += decode[12+(ofs)*7] * c; \ 5969 y6 += decode[13+(ofs)*7] * c; \ 5970 5971 #define stbir__3_coeff_remnant( ofs ) \ 5972 STBIR_SIMD_NO_UNROLL(decode); \ 5973 c = hc[0+(ofs)]; \ 5974 x0 += decode[0+(ofs)*7] * c; \ 5975 x1 += decode[1+(ofs)*7] * c; \ 5976 x2 += decode[2+(ofs)*7] * c; \ 5977 x3 += decode[3+(ofs)*7] * c; \ 5978 x4 += decode[4+(ofs)*7] * c; \ 5979 x5 += decode[5+(ofs)*7] * c; \ 5980 x6 += decode[6+(ofs)*7] * c; \ 5981 c = hc[1+(ofs)]; \ 5982 y0 += decode[7+(ofs)*7] * c; \ 5983 y1 += decode[8+(ofs)*7] * c; \ 5984 y2 += decode[9+(ofs)*7] * c; \ 5985 y3 += decode[10+(ofs)*7] * c; \ 5986 y4 += decode[11+(ofs)*7] * c; \ 5987 y5 += decode[12+(ofs)*7] * c; \ 5988 y6 += decode[13+(ofs)*7] * c; \ 5989 c = hc[2+(ofs)]; \ 5990 x0 += decode[14+(ofs)*7] * c; \ 5991 x1 += decode[15+(ofs)*7] * c; \ 5992 x2 += decode[16+(ofs)*7] * c; \ 5993 x3 += decode[17+(ofs)*7] * c; \ 5994 x4 += decode[18+(ofs)*7] * c; \ 5995 x5 += decode[19+(ofs)*7] * c; \ 5996 x6 += decode[20+(ofs)*7] * c; \ 5997 5998 #define stbir__store_output() \ 5999 output[0] = x0 + y0; \ 6000 output[1] = x1 + y1; \ 6001 output[2] = x2 + y2; \ 6002 output[3] = x3 + y3; \ 6003 output[4] = x4 + y4; \ 6004 output[5] = x5 + y5; \ 6005 output[6] = x6 + y6; \ 6006 horizontal_coefficients += coefficient_width; \ 6007 ++horizontal_contributors; \ 6008 output += 7; 6009 6010 #endif 6011 6012 #define STBIR__horizontal_channels 7 6013 #define STB_IMAGE_RESIZE_DO_HORIZONTALS 6014 #include STBIR__HEADER_FILENAME 6015 6016 6017 // include all of the vertical resamplers (both scatter and gather versions) 6018 6019 #define STBIR__vertical_channels 1 6020 #define STB_IMAGE_RESIZE_DO_VERTICALS 6021 #include STBIR__HEADER_FILENAME 6022 6023 #define STBIR__vertical_channels 1 6024 #define STB_IMAGE_RESIZE_DO_VERTICALS 6025 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6026 #include STBIR__HEADER_FILENAME 6027 6028 #define STBIR__vertical_channels 2 6029 #define STB_IMAGE_RESIZE_DO_VERTICALS 6030 #include STBIR__HEADER_FILENAME 6031 6032 #define STBIR__vertical_channels 2 6033 #define STB_IMAGE_RESIZE_DO_VERTICALS 6034 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6035 #include STBIR__HEADER_FILENAME 6036 6037 #define STBIR__vertical_channels 3 6038 #define STB_IMAGE_RESIZE_DO_VERTICALS 6039 #include STBIR__HEADER_FILENAME 6040 6041 #define STBIR__vertical_channels 3 6042 #define STB_IMAGE_RESIZE_DO_VERTICALS 6043 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6044 #include STBIR__HEADER_FILENAME 6045 6046 #define STBIR__vertical_channels 4 6047 #define STB_IMAGE_RESIZE_DO_VERTICALS 6048 #include STBIR__HEADER_FILENAME 6049 6050 #define STBIR__vertical_channels 4 6051 #define STB_IMAGE_RESIZE_DO_VERTICALS 6052 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6053 #include STBIR__HEADER_FILENAME 6054 6055 #define STBIR__vertical_channels 5 6056 #define STB_IMAGE_RESIZE_DO_VERTICALS 6057 #include STBIR__HEADER_FILENAME 6058 6059 #define STBIR__vertical_channels 5 6060 #define STB_IMAGE_RESIZE_DO_VERTICALS 6061 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6062 #include STBIR__HEADER_FILENAME 6063 6064 #define STBIR__vertical_channels 6 6065 #define STB_IMAGE_RESIZE_DO_VERTICALS 6066 #include STBIR__HEADER_FILENAME 6067 6068 #define STBIR__vertical_channels 6 6069 #define STB_IMAGE_RESIZE_DO_VERTICALS 6070 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6071 #include STBIR__HEADER_FILENAME 6072 6073 #define STBIR__vertical_channels 7 6074 #define STB_IMAGE_RESIZE_DO_VERTICALS 6075 #include STBIR__HEADER_FILENAME 6076 6077 #define STBIR__vertical_channels 7 6078 #define STB_IMAGE_RESIZE_DO_VERTICALS 6079 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6080 #include STBIR__HEADER_FILENAME 6081 6082 #define STBIR__vertical_channels 8 6083 #define STB_IMAGE_RESIZE_DO_VERTICALS 6084 #include STBIR__HEADER_FILENAME 6085 6086 #define STBIR__vertical_channels 8 6087 #define STB_IMAGE_RESIZE_DO_VERTICALS 6088 #define STB_IMAGE_RESIZE_VERTICAL_CONTINUE 6089 #include STBIR__HEADER_FILENAME 6090 6091 typedef void STBIR_VERTICAL_GATHERFUNC( float * output, float const * coeffs, float const ** inputs, float const * input0_end ); 6092 6093 static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers[ 8 ] = 6094 { 6095 stbir__vertical_gather_with_1_coeffs,stbir__vertical_gather_with_2_coeffs,stbir__vertical_gather_with_3_coeffs,stbir__vertical_gather_with_4_coeffs,stbir__vertical_gather_with_5_coeffs,stbir__vertical_gather_with_6_coeffs,stbir__vertical_gather_with_7_coeffs,stbir__vertical_gather_with_8_coeffs 6096 }; 6097 6098 static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers_continues[ 8 ] = 6099 { 6100 stbir__vertical_gather_with_1_coeffs_cont,stbir__vertical_gather_with_2_coeffs_cont,stbir__vertical_gather_with_3_coeffs_cont,stbir__vertical_gather_with_4_coeffs_cont,stbir__vertical_gather_with_5_coeffs_cont,stbir__vertical_gather_with_6_coeffs_cont,stbir__vertical_gather_with_7_coeffs_cont,stbir__vertical_gather_with_8_coeffs_cont 6101 }; 6102 6103 typedef void STBIR_VERTICAL_SCATTERFUNC( float ** outputs, float const * coeffs, float const * input, float const * input_end ); 6104 6105 static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_sets[ 8 ] = 6106 { 6107 stbir__vertical_scatter_with_1_coeffs,stbir__vertical_scatter_with_2_coeffs,stbir__vertical_scatter_with_3_coeffs,stbir__vertical_scatter_with_4_coeffs,stbir__vertical_scatter_with_5_coeffs,stbir__vertical_scatter_with_6_coeffs,stbir__vertical_scatter_with_7_coeffs,stbir__vertical_scatter_with_8_coeffs 6108 }; 6109 6110 static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_blends[ 8 ] = 6111 { 6112 stbir__vertical_scatter_with_1_coeffs_cont,stbir__vertical_scatter_with_2_coeffs_cont,stbir__vertical_scatter_with_3_coeffs_cont,stbir__vertical_scatter_with_4_coeffs_cont,stbir__vertical_scatter_with_5_coeffs_cont,stbir__vertical_scatter_with_6_coeffs_cont,stbir__vertical_scatter_with_7_coeffs_cont,stbir__vertical_scatter_with_8_coeffs_cont 6113 }; 6114 6115 6116 static void stbir__encode_scanline( stbir__info const * stbir_info, void *output_buffer_data, float * encode_buffer, int row STBIR_ONLY_PROFILE_GET_SPLIT_INFO ) 6117 { 6118 int num_pixels = stbir_info->horizontal.scale_info.output_sub_size; 6119 int channels = stbir_info->channels; 6120 int width_times_channels = num_pixels * channels; 6121 void * output_buffer; 6122 6123 // un-alpha weight if we need to 6124 if ( stbir_info->alpha_unweight ) 6125 { 6126 STBIR_PROFILE_START( unalpha ); 6127 stbir_info->alpha_unweight( encode_buffer, width_times_channels ); 6128 STBIR_PROFILE_END( unalpha ); 6129 } 6130 6131 // write directly into output by default 6132 output_buffer = output_buffer_data; 6133 6134 // if we have an output callback, we first convert the decode buffer in place (and then hand that to the callback) 6135 if ( stbir_info->out_pixels_cb ) 6136 output_buffer = encode_buffer; 6137 6138 STBIR_PROFILE_START( encode ); 6139 // convert into the output buffer 6140 stbir_info->encode_pixels( output_buffer, width_times_channels, encode_buffer ); 6141 STBIR_PROFILE_END( encode ); 6142 6143 // if we have an output callback, call it to send the data 6144 if ( stbir_info->out_pixels_cb ) 6145 stbir_info->out_pixels_cb( output_buffer, num_pixels, row, stbir_info->user_data ); 6146 } 6147 6148 6149 // Get the ring buffer pointer for an index 6150 static float* stbir__get_ring_buffer_entry(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int index ) 6151 { 6152 STBIR_ASSERT( index < stbir_info->ring_buffer_num_entries ); 6153 6154 #ifdef STBIR__SEPARATE_ALLOCATIONS 6155 return split_info->ring_buffers[ index ]; 6156 #else 6157 return (float*) ( ( (char*) split_info->ring_buffer ) + ( index * stbir_info->ring_buffer_length_bytes ) ); 6158 #endif 6159 } 6160 6161 // Get the specified scan line from the ring buffer 6162 static float* stbir__get_ring_buffer_scanline(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int get_scanline) 6163 { 6164 int ring_buffer_index = (split_info->ring_buffer_begin_index + (get_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries; 6165 return stbir__get_ring_buffer_entry( stbir_info, split_info, ring_buffer_index ); 6166 } 6167 6168 static void stbir__resample_horizontal_gather(stbir__info const * stbir_info, float* output_buffer, float const * input_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO ) 6169 { 6170 float const * decode_buffer = input_buffer - ( stbir_info->scanline_extents.conservative.n0 * stbir_info->effective_channels ); 6171 6172 STBIR_PROFILE_START( horizontal ); 6173 if ( ( stbir_info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( stbir_info->horizontal.scale_info.scale == 1.0f ) ) 6174 STBIR_MEMCPY( output_buffer, input_buffer, stbir_info->horizontal.scale_info.output_sub_size * sizeof( float ) * stbir_info->effective_channels ); 6175 else 6176 stbir_info->horizontal_gather_channels( output_buffer, stbir_info->horizontal.scale_info.output_sub_size, decode_buffer, stbir_info->horizontal.contributors, stbir_info->horizontal.coefficients, stbir_info->horizontal.coefficient_width ); 6177 STBIR_PROFILE_END( horizontal ); 6178 } 6179 6180 static void stbir__resample_vertical_gather(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n, int contrib_n0, int contrib_n1, float const * vertical_coefficients ) 6181 { 6182 float* encode_buffer = split_info->vertical_buffer; 6183 float* decode_buffer = split_info->decode_buffer; 6184 int vertical_first = stbir_info->vertical_first; 6185 int width = (vertical_first) ? ( stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1 ) : stbir_info->horizontal.scale_info.output_sub_size; 6186 int width_times_channels = stbir_info->effective_channels * width; 6187 6188 STBIR_ASSERT( stbir_info->vertical.is_gather ); 6189 6190 // loop over the contributing scanlines and scale into the buffer 6191 STBIR_PROFILE_START( vertical ); 6192 { 6193 int k = 0, total = contrib_n1 - contrib_n0 + 1; 6194 STBIR_ASSERT( total > 0 ); 6195 do { 6196 float const * inputs[8]; 6197 int i, cnt = total; if ( cnt > 8 ) cnt = 8; 6198 for( i = 0 ; i < cnt ; i++ ) 6199 inputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+contrib_n0 ); 6200 6201 // call the N scanlines at a time function (up to 8 scanlines of blending at once) 6202 ((k==0)?stbir__vertical_gathers:stbir__vertical_gathers_continues)[cnt-1]( (vertical_first) ? decode_buffer : encode_buffer, vertical_coefficients + k, inputs, inputs[0] + width_times_channels ); 6203 k += cnt; 6204 total -= cnt; 6205 } while ( total ); 6206 } 6207 STBIR_PROFILE_END( vertical ); 6208 6209 if ( vertical_first ) 6210 { 6211 // Now resample the gathered vertical data in the horizontal axis into the encode buffer 6212 stbir__resample_horizontal_gather(stbir_info, encode_buffer, decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6213 } 6214 6215 stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((size_t)n * (size_t)stbir_info->output_stride_bytes), 6216 encode_buffer, n STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6217 } 6218 6219 static void stbir__decode_and_resample_for_vertical_gather_loop(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n) 6220 { 6221 int ring_buffer_index; 6222 float* ring_buffer; 6223 6224 // Decode the nth scanline from the source image into the decode buffer. 6225 stbir__decode_scanline( stbir_info, n, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6226 6227 // update new end scanline 6228 split_info->ring_buffer_last_scanline = n; 6229 6230 // get ring buffer 6231 ring_buffer_index = (split_info->ring_buffer_begin_index + (split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries; 6232 ring_buffer = stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index); 6233 6234 // Now resample it into the ring buffer. 6235 stbir__resample_horizontal_gather( stbir_info, ring_buffer, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6236 6237 // Now it's sitting in the ring buffer ready to be used as source for the vertical sampling. 6238 } 6239 6240 static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count ) 6241 { 6242 int y, start_output_y, end_output_y; 6243 stbir__contributors* vertical_contributors = stbir_info->vertical.contributors; 6244 float const * vertical_coefficients = stbir_info->vertical.coefficients; 6245 6246 STBIR_ASSERT( stbir_info->vertical.is_gather ); 6247 6248 start_output_y = split_info->start_output_y; 6249 end_output_y = split_info[split_count-1].end_output_y; 6250 6251 vertical_contributors += start_output_y; 6252 vertical_coefficients += start_output_y * stbir_info->vertical.coefficient_width; 6253 6254 // initialize the ring buffer for gathering 6255 split_info->ring_buffer_begin_index = 0; 6256 split_info->ring_buffer_first_scanline = vertical_contributors->n0; 6257 split_info->ring_buffer_last_scanline = split_info->ring_buffer_first_scanline - 1; // means "empty" 6258 6259 for (y = start_output_y; y < end_output_y; y++) 6260 { 6261 int in_first_scanline, in_last_scanline; 6262 6263 in_first_scanline = vertical_contributors->n0; 6264 in_last_scanline = vertical_contributors->n1; 6265 6266 // make sure the indexing hasn't broken 6267 STBIR_ASSERT( in_first_scanline >= split_info->ring_buffer_first_scanline ); 6268 6269 // Load in new scanlines 6270 while (in_last_scanline > split_info->ring_buffer_last_scanline) 6271 { 6272 STBIR_ASSERT( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) <= stbir_info->ring_buffer_num_entries ); 6273 6274 // make sure there was room in the ring buffer when we add new scanlines 6275 if ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) 6276 { 6277 split_info->ring_buffer_first_scanline++; 6278 split_info->ring_buffer_begin_index++; 6279 } 6280 6281 if ( stbir_info->vertical_first ) 6282 { 6283 float * ring_buffer = stbir__get_ring_buffer_scanline( stbir_info, split_info, ++split_info->ring_buffer_last_scanline ); 6284 // Decode the nth scanline from the source image into the decode buffer. 6285 stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6286 } 6287 else 6288 { 6289 stbir__decode_and_resample_for_vertical_gather_loop(stbir_info, split_info, split_info->ring_buffer_last_scanline + 1); 6290 } 6291 } 6292 6293 // Now all buffers should be ready to write a row of vertical sampling, so do it. 6294 stbir__resample_vertical_gather(stbir_info, split_info, y, in_first_scanline, in_last_scanline, vertical_coefficients ); 6295 6296 ++vertical_contributors; 6297 vertical_coefficients += stbir_info->vertical.coefficient_width; 6298 } 6299 } 6300 6301 #define STBIR__FLOAT_EMPTY_MARKER 3.0e+38F 6302 #define STBIR__FLOAT_BUFFER_IS_EMPTY(ptr) ((ptr)[0]==STBIR__FLOAT_EMPTY_MARKER) 6303 6304 static void stbir__encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info) 6305 { 6306 // evict a scanline out into the output buffer 6307 float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index ); 6308 6309 // dump the scanline out 6310 stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6311 6312 // mark it as empty 6313 ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER; 6314 6315 // advance the first scanline 6316 split_info->ring_buffer_first_scanline++; 6317 if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries ) 6318 split_info->ring_buffer_begin_index = 0; 6319 } 6320 6321 static void stbir__horizontal_resample_and_encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info) 6322 { 6323 // evict a scanline out into the output buffer 6324 6325 float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index ); 6326 6327 // Now resample it into the buffer. 6328 stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, ring_buffer_entry STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6329 6330 // dump the scanline out 6331 stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6332 6333 // mark it as empty 6334 ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER; 6335 6336 // advance the first scanline 6337 split_info->ring_buffer_first_scanline++; 6338 if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries ) 6339 split_info->ring_buffer_begin_index = 0; 6340 } 6341 6342 static void stbir__resample_vertical_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n0, int n1, float const * vertical_coefficients, float const * vertical_buffer, float const * vertical_buffer_end ) 6343 { 6344 STBIR_ASSERT( !stbir_info->vertical.is_gather ); 6345 6346 STBIR_PROFILE_START( vertical ); 6347 { 6348 int k = 0, total = n1 - n0 + 1; 6349 STBIR_ASSERT( total > 0 ); 6350 do { 6351 float * outputs[8]; 6352 int i, n = total; if ( n > 8 ) n = 8; 6353 for( i = 0 ; i < n ; i++ ) 6354 { 6355 outputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+n0 ); 6356 if ( ( i ) && ( STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[i] ) != STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ) ) ) // make sure runs are of the same type 6357 { 6358 n = i; 6359 break; 6360 } 6361 } 6362 // call the scatter to N scanlines at a time function (up to 8 scanlines of scattering at once) 6363 ((STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ))?stbir__vertical_scatter_sets:stbir__vertical_scatter_blends)[n-1]( outputs, vertical_coefficients + k, vertical_buffer, vertical_buffer_end ); 6364 k += n; 6365 total -= n; 6366 } while ( total ); 6367 } 6368 6369 STBIR_PROFILE_END( vertical ); 6370 } 6371 6372 typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info); 6373 6374 static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count ) 6375 { 6376 int y, start_output_y, end_output_y, start_input_y, end_input_y; 6377 stbir__contributors* vertical_contributors = stbir_info->vertical.contributors; 6378 float const * vertical_coefficients = stbir_info->vertical.coefficients; 6379 stbir__handle_scanline_for_scatter_func * handle_scanline_for_scatter; 6380 void * scanline_scatter_buffer; 6381 void * scanline_scatter_buffer_end; 6382 int on_first_input_y, last_input_y; 6383 6384 STBIR_ASSERT( !stbir_info->vertical.is_gather ); 6385 6386 start_output_y = split_info->start_output_y; 6387 end_output_y = split_info[split_count-1].end_output_y; // may do multiple split counts 6388 6389 start_input_y = split_info->start_input_y; 6390 end_input_y = split_info[split_count-1].end_input_y; 6391 6392 // adjust for starting offset start_input_y 6393 y = start_input_y + stbir_info->vertical.filter_pixel_margin; 6394 vertical_contributors += y ; 6395 vertical_coefficients += stbir_info->vertical.coefficient_width * y; 6396 6397 if ( stbir_info->vertical_first ) 6398 { 6399 handle_scanline_for_scatter = stbir__horizontal_resample_and_encode_first_scanline_from_scatter; 6400 scanline_scatter_buffer = split_info->decode_buffer; 6401 scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * (stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1); 6402 } 6403 else 6404 { 6405 handle_scanline_for_scatter = stbir__encode_first_scanline_from_scatter; 6406 scanline_scatter_buffer = split_info->vertical_buffer; 6407 scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * stbir_info->horizontal.scale_info.output_sub_size; 6408 } 6409 6410 // initialize the ring buffer for scattering 6411 split_info->ring_buffer_first_scanline = start_output_y; 6412 split_info->ring_buffer_last_scanline = -1; 6413 split_info->ring_buffer_begin_index = -1; 6414 6415 // mark all the buffers as empty to start 6416 for( y = 0 ; y < stbir_info->ring_buffer_num_entries ; y++ ) 6417 stbir__get_ring_buffer_entry( stbir_info, split_info, y )[0] = STBIR__FLOAT_EMPTY_MARKER; // only used on scatter 6418 6419 // do the loop in input space 6420 on_first_input_y = 1; last_input_y = start_input_y; 6421 for (y = start_input_y ; y < end_input_y; y++) 6422 { 6423 int out_first_scanline, out_last_scanline; 6424 6425 out_first_scanline = vertical_contributors->n0; 6426 out_last_scanline = vertical_contributors->n1; 6427 6428 STBIR_ASSERT(out_last_scanline - out_first_scanline + 1 <= stbir_info->ring_buffer_num_entries); 6429 6430 if ( ( out_last_scanline >= out_first_scanline ) && ( ( ( out_first_scanline >= start_output_y ) && ( out_first_scanline < end_output_y ) ) || ( ( out_last_scanline >= start_output_y ) && ( out_last_scanline < end_output_y ) ) ) ) 6431 { 6432 float const * vc = vertical_coefficients; 6433 6434 // keep track of the range actually seen for the next resize 6435 last_input_y = y; 6436 if ( ( on_first_input_y ) && ( y > start_input_y ) ) 6437 split_info->start_input_y = y; 6438 on_first_input_y = 0; 6439 6440 // clip the region 6441 if ( out_first_scanline < start_output_y ) 6442 { 6443 vc += start_output_y - out_first_scanline; 6444 out_first_scanline = start_output_y; 6445 } 6446 6447 if ( out_last_scanline >= end_output_y ) 6448 out_last_scanline = end_output_y - 1; 6449 6450 // if very first scanline, init the index 6451 if (split_info->ring_buffer_begin_index < 0) 6452 split_info->ring_buffer_begin_index = out_first_scanline - start_output_y; 6453 6454 STBIR_ASSERT( split_info->ring_buffer_begin_index <= out_first_scanline ); 6455 6456 // Decode the nth scanline from the source image into the decode buffer. 6457 stbir__decode_scanline( stbir_info, y, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6458 6459 // When horizontal first, we resample horizontally into the vertical buffer before we scatter it out 6460 if ( !stbir_info->vertical_first ) 6461 stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 6462 6463 // Now it's sitting in the buffer ready to be distributed into the ring buffers. 6464 6465 // evict from the ringbuffer, if we need are full 6466 if ( ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) && 6467 ( out_last_scanline > split_info->ring_buffer_last_scanline ) ) 6468 handle_scanline_for_scatter( stbir_info, split_info ); 6469 6470 // Now the horizontal buffer is ready to write to all ring buffer rows, so do it. 6471 stbir__resample_vertical_scatter(stbir_info, split_info, out_first_scanline, out_last_scanline, vc, (float*)scanline_scatter_buffer, (float*)scanline_scatter_buffer_end ); 6472 6473 // update the end of the buffer 6474 if ( out_last_scanline > split_info->ring_buffer_last_scanline ) 6475 split_info->ring_buffer_last_scanline = out_last_scanline; 6476 } 6477 ++vertical_contributors; 6478 vertical_coefficients += stbir_info->vertical.coefficient_width; 6479 } 6480 6481 // now evict the scanlines that are left over in the ring buffer 6482 while ( split_info->ring_buffer_first_scanline < end_output_y ) 6483 handle_scanline_for_scatter(stbir_info, split_info); 6484 6485 // update the end_input_y if we do multiple resizes with the same data 6486 ++last_input_y; 6487 for( y = 0 ; y < split_count; y++ ) 6488 if ( split_info[y].end_input_y > last_input_y ) 6489 split_info[y].end_input_y = last_input_y; 6490 } 6491 6492 6493 static stbir__kernel_callback * stbir__builtin_kernels[] = { 0, stbir__filter_trapezoid, stbir__filter_triangle, stbir__filter_cubic, stbir__filter_catmullrom, stbir__filter_mitchell, stbir__filter_point }; 6494 static stbir__support_callback * stbir__builtin_supports[] = { 0, stbir__support_trapezoid, stbir__support_one, stbir__support_two, stbir__support_two, stbir__support_two, stbir__support_zeropoint5 }; 6495 6496 static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir__kernel_callback * kernel, stbir__support_callback * support, stbir_edge edge, stbir__scale_info * scale_info, int always_gather, void * user_data ) 6497 { 6498 // set filter 6499 if (filter == 0) 6500 { 6501 filter = STBIR_DEFAULT_FILTER_DOWNSAMPLE; // default to downsample 6502 if (scale_info->scale >= ( 1.0f - stbir__small_float ) ) 6503 { 6504 if ( (scale_info->scale <= ( 1.0f + stbir__small_float ) ) && ( STBIR_CEILF(scale_info->pixel_shift) == scale_info->pixel_shift ) ) 6505 filter = STBIR_FILTER_POINT_SAMPLE; 6506 else 6507 filter = STBIR_DEFAULT_FILTER_UPSAMPLE; 6508 } 6509 } 6510 samp->filter_enum = filter; 6511 6512 STBIR_ASSERT(samp->filter_enum != 0); 6513 STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER); 6514 samp->filter_kernel = stbir__builtin_kernels[ filter ]; 6515 samp->filter_support = stbir__builtin_supports[ filter ]; 6516 6517 if ( kernel && support ) 6518 { 6519 samp->filter_kernel = kernel; 6520 samp->filter_support = support; 6521 samp->filter_enum = STBIR_FILTER_OTHER; 6522 } 6523 6524 samp->edge = edge; 6525 samp->filter_pixel_width = stbir__get_filter_pixel_width (samp->filter_support, scale_info->scale, user_data ); 6526 // Gather is always better, but in extreme downsamples, you have to most or all of the data in memory 6527 // For horizontal, we always have all the pixels, so we always use gather here (always_gather==1). 6528 // For vertical, we use gather if scaling up (which means we will have samp->filter_pixel_width 6529 // scanlines in memory at once). 6530 samp->is_gather = 0; 6531 if ( scale_info->scale >= ( 1.0f - stbir__small_float ) ) 6532 samp->is_gather = 1; 6533 else if ( ( always_gather ) || ( samp->filter_pixel_width <= STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT ) ) 6534 samp->is_gather = 2; 6535 6536 // pre calculate stuff based on the above 6537 samp->coefficient_width = stbir__get_coefficient_width(samp, samp->is_gather, user_data); 6538 6539 // filter_pixel_width is the conservative size in pixels of input that affect an output pixel. 6540 // In rare cases (only with 2 pix to 1 pix with the default filters), it's possible that the 6541 // filter will extend before or after the scanline beyond just one extra entire copy of the 6542 // scanline (we would hit the edge twice). We don't let you do that, so we clamp the total 6543 // width to 3x the total of input pixel (once for the scanline, once for the left side 6544 // overhang, and once for the right side). We only do this for edge mode, since the other 6545 // modes can just re-edge clamp back in again. 6546 if ( edge == STBIR_EDGE_WRAP ) 6547 if ( samp->filter_pixel_width > ( scale_info->input_full_size * 3 ) ) 6548 samp->filter_pixel_width = scale_info->input_full_size * 3; 6549 6550 // This is how much to expand buffers to account for filters seeking outside 6551 // the image boundaries. 6552 samp->filter_pixel_margin = samp->filter_pixel_width / 2; 6553 6554 // filter_pixel_margin is the amount that this filter can overhang on just one side of either 6555 // end of the scanline (left or the right). Since we only allow you to overhang 1 scanline's 6556 // worth of pixels, we clamp this one side of overhang to the input scanline size. Again, 6557 // this clamping only happens in rare cases with the default filters (2 pix to 1 pix). 6558 if ( edge == STBIR_EDGE_WRAP ) 6559 if ( samp->filter_pixel_margin > scale_info->input_full_size ) 6560 samp->filter_pixel_margin = scale_info->input_full_size; 6561 6562 samp->num_contributors = stbir__get_contributors(samp, samp->is_gather); 6563 6564 samp->contributors_size = samp->num_contributors * sizeof(stbir__contributors); 6565 samp->coefficients_size = samp->num_contributors * samp->coefficient_width * sizeof(float) + sizeof(float); // extra sizeof(float) is padding 6566 6567 samp->gather_prescatter_contributors = 0; 6568 samp->gather_prescatter_coefficients = 0; 6569 if ( samp->is_gather == 0 ) 6570 { 6571 samp->gather_prescatter_coefficient_width = samp->filter_pixel_width; 6572 samp->gather_prescatter_num_contributors = stbir__get_contributors(samp, 2); 6573 samp->gather_prescatter_contributors_size = samp->gather_prescatter_num_contributors * sizeof(stbir__contributors); 6574 samp->gather_prescatter_coefficients_size = samp->gather_prescatter_num_contributors * samp->gather_prescatter_coefficient_width * sizeof(float); 6575 } 6576 } 6577 6578 static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contributors * range, void * user_data ) 6579 { 6580 float scale = samp->scale_info.scale; 6581 float out_shift = samp->scale_info.pixel_shift; 6582 stbir__support_callback * support = samp->filter_support; 6583 int input_full_size = samp->scale_info.input_full_size; 6584 stbir_edge edge = samp->edge; 6585 float inv_scale = samp->scale_info.inv_scale; 6586 6587 STBIR_ASSERT( samp->is_gather != 0 ); 6588 6589 if ( samp->is_gather == 1 ) 6590 { 6591 int in_first_pixel, in_last_pixel; 6592 float out_filter_radius = support(inv_scale, user_data) * scale; 6593 6594 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0.5, out_filter_radius, inv_scale, out_shift, input_full_size, edge ); 6595 range->n0 = in_first_pixel; 6596 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, ( (float)(samp->scale_info.output_sub_size-1) ) + 0.5f, out_filter_radius, inv_scale, out_shift, input_full_size, edge ); 6597 range->n1 = in_last_pixel; 6598 } 6599 else if ( samp->is_gather == 2 ) // downsample gather, refine 6600 { 6601 float in_pixels_radius = support(scale, user_data) * inv_scale; 6602 int filter_pixel_margin = samp->filter_pixel_margin; 6603 int output_sub_size = samp->scale_info.output_sub_size; 6604 int input_end; 6605 int n; 6606 int in_first_pixel, in_last_pixel; 6607 6608 // get a conservative area of the input range 6609 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0, 0, inv_scale, out_shift, input_full_size, edge ); 6610 range->n0 = in_first_pixel; 6611 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, (float)output_sub_size, 0, inv_scale, out_shift, input_full_size, edge ); 6612 range->n1 = in_last_pixel; 6613 6614 // now go through the margin to the start of area to find bottom 6615 n = range->n0 + 1; 6616 input_end = -filter_pixel_margin; 6617 while( n >= input_end ) 6618 { 6619 int out_first_pixel, out_last_pixel; 6620 stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size ); 6621 if ( out_first_pixel > out_last_pixel ) 6622 break; 6623 6624 if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) ) 6625 range->n0 = n; 6626 --n; 6627 } 6628 6629 // now go through the end of the area through the margin to find top 6630 n = range->n1 - 1; 6631 input_end = n + 1 + filter_pixel_margin; 6632 while( n <= input_end ) 6633 { 6634 int out_first_pixel, out_last_pixel; 6635 stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size ); 6636 if ( out_first_pixel > out_last_pixel ) 6637 break; 6638 if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) ) 6639 range->n1 = n; 6640 ++n; 6641 } 6642 } 6643 6644 if ( samp->edge == STBIR_EDGE_WRAP ) 6645 { 6646 // if we are wrapping, and we are very close to the image size (so the edges might merge), just use the scanline up to the edge 6647 if ( ( range->n0 > 0 ) && ( range->n1 >= input_full_size ) ) 6648 { 6649 int marg = range->n1 - input_full_size + 1; 6650 if ( ( marg + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= range->n0 ) 6651 range->n0 = 0; 6652 } 6653 if ( ( range->n0 < 0 ) && ( range->n1 < (input_full_size-1) ) ) 6654 { 6655 int marg = -range->n0; 6656 if ( ( input_full_size - marg - STBIR__MERGE_RUNS_PIXEL_THRESHOLD - 1 ) <= range->n1 ) 6657 range->n1 = input_full_size - 1; 6658 } 6659 } 6660 else 6661 { 6662 // for non-edge-wrap modes, we never read over the edge, so clamp 6663 if ( range->n0 < 0 ) 6664 range->n0 = 0; 6665 if ( range->n1 >= input_full_size ) 6666 range->n1 = input_full_size - 1; 6667 } 6668 } 6669 6670 static void stbir__get_split_info( stbir__per_split_info* split_info, int splits, int output_height, int vertical_pixel_margin, int input_full_height ) 6671 { 6672 int i, cur; 6673 int left = output_height; 6674 6675 cur = 0; 6676 for( i = 0 ; i < splits ; i++ ) 6677 { 6678 int each; 6679 split_info[i].start_output_y = cur; 6680 each = left / ( splits - i ); 6681 split_info[i].end_output_y = cur + each; 6682 cur += each; 6683 left -= each; 6684 6685 // scatter range (updated to minimum as you run it) 6686 split_info[i].start_input_y = -vertical_pixel_margin; 6687 split_info[i].end_input_y = input_full_height + vertical_pixel_margin; 6688 } 6689 } 6690 6691 static void stbir__free_internal_mem( stbir__info *info ) 6692 { 6693 #define STBIR__FREE_AND_CLEAR( ptr ) { if ( ptr ) { void * p = (ptr); (ptr) = 0; STBIR_FREE( p, info->user_data); } } 6694 6695 if ( info ) 6696 { 6697 #ifndef STBIR__SEPARATE_ALLOCATIONS 6698 STBIR__FREE_AND_CLEAR( info->alloced_mem ); 6699 #else 6700 int i,j; 6701 6702 if ( ( info->vertical.gather_prescatter_contributors ) && ( (void*)info->vertical.gather_prescatter_contributors != (void*)info->split_info[0].decode_buffer ) ) 6703 { 6704 STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_coefficients ); 6705 STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_contributors ); 6706 } 6707 for( i = 0 ; i < info->splits ; i++ ) 6708 { 6709 for( j = 0 ; j < info->alloc_ring_buffer_num_entries ; j++ ) 6710 { 6711 #ifdef STBIR_SIMD8 6712 if ( info->effective_channels == 3 ) 6713 --info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer 6714 #endif 6715 STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers[j] ); 6716 } 6717 6718 #ifdef STBIR_SIMD8 6719 if ( info->effective_channels == 3 ) 6720 --info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer 6721 #endif 6722 STBIR__FREE_AND_CLEAR( info->split_info[i].decode_buffer ); 6723 STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers ); 6724 STBIR__FREE_AND_CLEAR( info->split_info[i].vertical_buffer ); 6725 } 6726 STBIR__FREE_AND_CLEAR( info->split_info ); 6727 if ( info->vertical.coefficients != info->horizontal.coefficients ) 6728 { 6729 STBIR__FREE_AND_CLEAR( info->vertical.coefficients ); 6730 STBIR__FREE_AND_CLEAR( info->vertical.contributors ); 6731 } 6732 STBIR__FREE_AND_CLEAR( info->horizontal.coefficients ); 6733 STBIR__FREE_AND_CLEAR( info->horizontal.contributors ); 6734 STBIR__FREE_AND_CLEAR( info->alloced_mem ); 6735 STBIR_FREE( info, info->user_data ); 6736 #endif 6737 } 6738 6739 #undef STBIR__FREE_AND_CLEAR 6740 } 6741 6742 static int stbir__get_max_split( int splits, int height ) 6743 { 6744 int i; 6745 int max = 0; 6746 6747 for( i = 0 ; i < splits ; i++ ) 6748 { 6749 int each = height / ( splits - i ); 6750 if ( each > max ) 6751 max = each; 6752 height -= each; 6753 } 6754 return max; 6755 } 6756 6757 static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] = 6758 { 6759 0, stbir__horizontal_gather_1_channels_with_n_coeffs_funcs, stbir__horizontal_gather_2_channels_with_n_coeffs_funcs, stbir__horizontal_gather_3_channels_with_n_coeffs_funcs, stbir__horizontal_gather_4_channels_with_n_coeffs_funcs, 0,0, stbir__horizontal_gather_7_channels_with_n_coeffs_funcs 6760 }; 6761 6762 static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] = 6763 { 6764 0, stbir__horizontal_gather_1_channels_funcs, stbir__horizontal_gather_2_channels_funcs, stbir__horizontal_gather_3_channels_funcs, stbir__horizontal_gather_4_channels_funcs, 0,0, stbir__horizontal_gather_7_channels_funcs 6765 }; 6766 6767 // there are six resize classifications: 0 == vertical scatter, 1 == vertical gather < 1x scale, 2 == vertical gather 1x-2x scale, 4 == vertical gather < 3x scale, 4 == vertical gather > 3x scale, 5 == <=4 pixel height, 6 == <=4 pixel wide column 6768 #define STBIR_RESIZE_CLASSIFICATIONS 8 6769 6770 static float stbir__compute_weights[5][STBIR_RESIZE_CLASSIFICATIONS][4]= // 5 = 0=1chan, 1=2chan, 2=3chan, 3=4chan, 4=7chan 6771 { 6772 { 6773 { 1.00000f, 1.00000f, 0.31250f, 1.00000f }, 6774 { 0.56250f, 0.59375f, 0.00000f, 0.96875f }, 6775 { 1.00000f, 0.06250f, 0.00000f, 1.00000f }, 6776 { 0.00000f, 0.09375f, 1.00000f, 1.00000f }, 6777 { 1.00000f, 1.00000f, 1.00000f, 1.00000f }, 6778 { 0.03125f, 0.12500f, 1.00000f, 1.00000f }, 6779 { 0.06250f, 0.12500f, 0.00000f, 1.00000f }, 6780 { 0.00000f, 1.00000f, 0.00000f, 0.03125f }, 6781 }, { 6782 { 0.00000f, 0.84375f, 0.00000f, 0.03125f }, 6783 { 0.09375f, 0.93750f, 0.00000f, 0.78125f }, 6784 { 0.87500f, 0.21875f, 0.00000f, 0.96875f }, 6785 { 0.09375f, 0.09375f, 1.00000f, 1.00000f }, 6786 { 1.00000f, 1.00000f, 1.00000f, 1.00000f }, 6787 { 0.03125f, 0.12500f, 1.00000f, 1.00000f }, 6788 { 0.06250f, 0.12500f, 0.00000f, 1.00000f }, 6789 { 0.00000f, 1.00000f, 0.00000f, 0.53125f }, 6790 }, { 6791 { 0.00000f, 0.53125f, 0.00000f, 0.03125f }, 6792 { 0.06250f, 0.96875f, 0.00000f, 0.53125f }, 6793 { 0.87500f, 0.18750f, 0.00000f, 0.93750f }, 6794 { 0.00000f, 0.09375f, 1.00000f, 1.00000f }, 6795 { 1.00000f, 1.00000f, 1.00000f, 1.00000f }, 6796 { 0.03125f, 0.12500f, 1.00000f, 1.00000f }, 6797 { 0.06250f, 0.12500f, 0.00000f, 1.00000f }, 6798 { 0.00000f, 1.00000f, 0.00000f, 0.56250f }, 6799 }, { 6800 { 0.00000f, 0.50000f, 0.00000f, 0.71875f }, 6801 { 0.06250f, 0.84375f, 0.00000f, 0.87500f }, 6802 { 1.00000f, 0.50000f, 0.50000f, 0.96875f }, 6803 { 1.00000f, 0.09375f, 0.31250f, 0.50000f }, 6804 { 1.00000f, 1.00000f, 1.00000f, 1.00000f }, 6805 { 1.00000f, 0.03125f, 0.03125f, 0.53125f }, 6806 { 0.18750f, 0.12500f, 0.00000f, 1.00000f }, 6807 { 0.00000f, 1.00000f, 0.03125f, 0.18750f }, 6808 }, { 6809 { 0.00000f, 0.59375f, 0.00000f, 0.96875f }, 6810 { 0.06250f, 0.81250f, 0.06250f, 0.59375f }, 6811 { 0.75000f, 0.43750f, 0.12500f, 0.96875f }, 6812 { 0.87500f, 0.06250f, 0.18750f, 0.43750f }, 6813 { 1.00000f, 1.00000f, 1.00000f, 1.00000f }, 6814 { 0.15625f, 0.12500f, 1.00000f, 1.00000f }, 6815 { 0.06250f, 0.12500f, 0.00000f, 1.00000f }, 6816 { 0.00000f, 1.00000f, 0.03125f, 0.34375f }, 6817 } 6818 }; 6819 6820 // structure that allow us to query and override info for training the costs 6821 typedef struct STBIR__V_FIRST_INFO 6822 { 6823 double v_cost, h_cost; 6824 int control_v_first; // 0 = no control, 1 = force hori, 2 = force vert 6825 int v_first; 6826 int v_resize_classification; 6827 int is_gather; 6828 } STBIR__V_FIRST_INFO; 6829 6830 #ifdef STBIR__V_FIRST_INFO_BUFFER 6831 static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0}; 6832 #define STBIR__V_FIRST_INFO_POINTER &STBIR__V_FIRST_INFO_BUFFER 6833 #else 6834 #define STBIR__V_FIRST_INFO_POINTER 0 6835 #endif 6836 6837 // Figure out whether to scale along the horizontal or vertical first. 6838 // This only *super* important when you are scaling by a massively 6839 // different amount in the vertical vs the horizontal (for example, if 6840 // you are scaling by 2x in the width, and 0.5x in the height, then you 6841 // want to do the vertical scale first, because it's around 3x faster 6842 // in that order. 6843 // 6844 // In more normal circumstances, this makes a 20-40% differences, so 6845 // it's good to get right, but not critical. The normal way that you 6846 // decide which direction goes first is just figuring out which 6847 // direction does more multiplies. But with modern CPUs with their 6848 // fancy caches and SIMD and high IPC abilities, so there's just a lot 6849 // more that goes into it. 6850 // 6851 // My handwavy sort of solution is to have an app that does a whole 6852 // bunch of timing for both vertical and horizontal first modes, 6853 // and then another app that can read lots of these timing files 6854 // and try to search for the best weights to use. Dotimings.c 6855 // is the app that does a bunch of timings, and vf_train.c is the 6856 // app that solves for the best weights (and shows how well it 6857 // does currently). 6858 6859 static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info ) 6860 { 6861 double v_cost, h_cost; 6862 float * weights; 6863 int vertical_first; 6864 int v_classification; 6865 6866 // categorize the resize into buckets 6867 if ( ( vertical_output_size <= 4 ) || ( horizontal_output_size <= 4 ) ) 6868 v_classification = ( vertical_output_size < horizontal_output_size ) ? 6 : 7; 6869 else if ( vertical_scale <= 1.0f ) 6870 v_classification = ( is_gather ) ? 1 : 0; 6871 else if ( vertical_scale <= 2.0f) 6872 v_classification = 2; 6873 else if ( vertical_scale <= 3.0f) 6874 v_classification = 3; 6875 else if ( vertical_scale <= 4.0f) 6876 v_classification = 5; 6877 else 6878 v_classification = 6; 6879 6880 // use the right weights 6881 weights = weights_table[ v_classification ]; 6882 6883 // this is the costs when you don't take into account modern CPUs with high ipc and simd and caches - wish we had a better estimate 6884 h_cost = (float)horizontal_filter_pixel_width * weights[0] + horizontal_scale * (float)vertical_filter_pixel_width * weights[1]; 6885 v_cost = (float)vertical_filter_pixel_width * weights[2] + vertical_scale * (float)horizontal_filter_pixel_width * weights[3]; 6886 6887 // use computation estimate to decide vertical first or not 6888 vertical_first = ( v_cost <= h_cost ) ? 1 : 0; 6889 6890 // save these, if requested 6891 if ( info ) 6892 { 6893 info->h_cost = h_cost; 6894 info->v_cost = v_cost; 6895 info->v_resize_classification = v_classification; 6896 info->v_first = vertical_first; 6897 info->is_gather = is_gather; 6898 } 6899 6900 // and this allows us to override everything for testing (see dotiming.c) 6901 if ( ( info ) && ( info->control_v_first ) ) 6902 vertical_first = ( info->control_v_first == 2 ) ? 1 : 0; 6903 6904 return vertical_first; 6905 } 6906 6907 // layout lookups - must match stbir_internal_pixel_layout 6908 static unsigned char stbir__pixel_channels[] = { 6909 1,2,3,3,4, // 1ch, 2ch, rgb, bgr, 4ch 6910 4,4,4,4,2,2, // RGBA,BGRA,ARGB,ABGR,RA,AR 6911 4,4,4,4,2,2, // RGBA_PM,BGRA_PM,ARGB_PM,ABGR_PM,RA_PM,AR_PM 6912 }; 6913 6914 // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types 6915 // the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible 6916 static stbir_internal_pixel_layout stbir__pixel_layout_convert_public_to_internal[] = { 6917 STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA, 6918 STBIRI_4CHANNEL, STBIRI_BGRA, STBIRI_ARGB, STBIRI_ABGR, STBIRI_RA, STBIRI_AR, 6919 STBIRI_RGBA_PM, STBIRI_BGRA_PM, STBIRI_ARGB_PM, STBIRI_ABGR_PM, STBIRI_RA_PM, STBIRI_AR_PM, 6920 }; 6921 6922 static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sampler * horizontal, stbir__sampler * vertical, stbir__contributors * conservative, stbir_pixel_layout input_pixel_layout_public, stbir_pixel_layout output_pixel_layout_public, int splits, int new_x, int new_y, int fast_alpha, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO ) 6923 { 6924 static char stbir_channel_count_index[8]={ 9,0,1,2, 3,9,9,4 }; 6925 6926 stbir__info * info = 0; 6927 void * alloced = 0; 6928 size_t alloced_total = 0; 6929 int vertical_first; 6930 int decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size, vertical_buffer_size, alloc_ring_buffer_num_entries; 6931 6932 int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy 6933 int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size ); 6934 stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ]; 6935 stbir_internal_pixel_layout output_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ output_pixel_layout_public ]; 6936 int channels = stbir__pixel_channels[ input_pixel_layout ]; 6937 int effective_channels = channels; 6938 6939 // first figure out what type of alpha weighting to use (if any) 6940 if ( ( horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE ) || ( vertical->filter_enum != STBIR_FILTER_POINT_SAMPLE ) ) // no alpha weighting on point sampling 6941 { 6942 if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) ) 6943 { 6944 if ( fast_alpha ) 6945 { 6946 alpha_weighting_type = 4; 6947 } 6948 else 6949 { 6950 static int fancy_alpha_effective_cnts[6] = { 7, 7, 7, 7, 3, 3 }; 6951 alpha_weighting_type = 2; 6952 effective_channels = fancy_alpha_effective_cnts[ input_pixel_layout - STBIRI_RGBA ]; 6953 } 6954 } 6955 else if ( ( input_pixel_layout >= STBIRI_RGBA_PM ) && ( input_pixel_layout <= STBIRI_AR_PM ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) ) 6956 { 6957 // input premult, output non-premult 6958 alpha_weighting_type = 3; 6959 } 6960 else if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA_PM ) && ( output_pixel_layout <= STBIRI_AR_PM ) ) 6961 { 6962 // input non-premult, output premult 6963 alpha_weighting_type = 1; 6964 } 6965 } 6966 6967 // channel in and out count must match currently 6968 if ( channels != stbir__pixel_channels[ output_pixel_layout ] ) 6969 return 0; 6970 6971 // get vertical first 6972 vertical_first = stbir__should_do_vertical_first( stbir__compute_weights[ (int)stbir_channel_count_index[ effective_channels ] ], horizontal->filter_pixel_width, horizontal->scale_info.scale, horizontal->scale_info.output_sub_size, vertical->filter_pixel_width, vertical->scale_info.scale, vertical->scale_info.output_sub_size, vertical->is_gather, STBIR__V_FIRST_INFO_POINTER ); 6973 6974 // sometimes read one float off in some of the unrolled loops (with a weight of zero coeff, so it doesn't have an effect) 6975 decode_buffer_size = ( conservative->n1 - conservative->n0 + 1 ) * effective_channels * sizeof(float) + sizeof(float); // extra float for padding 6976 6977 #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8) 6978 if ( effective_channels == 3 ) 6979 decode_buffer_size += sizeof(float); // avx in 3 channel mode needs one float at the start of the buffer (only with separate allocations) 6980 #endif 6981 6982 ring_buffer_length_bytes = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding 6983 6984 // if we do vertical first, the ring buffer holds a whole decoded line 6985 if ( vertical_first ) 6986 ring_buffer_length_bytes = ( decode_buffer_size + 15 ) & ~15; 6987 6988 if ( ( ring_buffer_length_bytes & 4095 ) == 0 ) ring_buffer_length_bytes += 64*3; // avoid 4k alias 6989 6990 // One extra entry because floating point precision problems sometimes cause an extra to be necessary. 6991 alloc_ring_buffer_num_entries = vertical->filter_pixel_width + 1; 6992 6993 // we never need more ring buffer entries than the scanlines we're outputting when in scatter mode 6994 if ( ( !vertical->is_gather ) && ( alloc_ring_buffer_num_entries > conservative_split_output_size ) ) 6995 alloc_ring_buffer_num_entries = conservative_split_output_size; 6996 6997 ring_buffer_size = alloc_ring_buffer_num_entries * ring_buffer_length_bytes; 6998 6999 // The vertical buffer is used differently, depending on whether we are scattering 7000 // the vertical scanlines, or gathering them. 7001 // If scattering, it's used at the temp buffer to accumulate each output. 7002 // If gathering, it's just the output buffer. 7003 vertical_buffer_size = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding 7004 7005 // we make two passes through this loop, 1st to add everything up, 2nd to allocate and init 7006 for(;;) 7007 { 7008 int i; 7009 void * advance_mem = alloced; 7010 int copy_horizontal = 0; 7011 stbir__sampler * possibly_use_horizontal_for_pivot = 0; 7012 7013 #ifdef STBIR__SEPARATE_ALLOCATIONS 7014 #define STBIR__NEXT_PTR( ptr, size, ntype ) if ( alloced ) { void * p = STBIR_MALLOC( size, user_data); if ( p == 0 ) { stbir__free_internal_mem( info ); return 0; } (ptr) = (ntype*)p; } 7015 #else 7016 #define STBIR__NEXT_PTR( ptr, size, ntype ) advance_mem = (void*) ( ( ((size_t)advance_mem) + 15 ) & ~15 ); if ( alloced ) ptr = (ntype*)advance_mem; advance_mem = ((char*)advance_mem) + (size); 7017 #endif 7018 7019 STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info ); 7020 7021 STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info ); 7022 7023 if ( info ) 7024 { 7025 static stbir__alpha_weight_func * fancy_alpha_weights[6] = { stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_2ch, stbir__fancy_alpha_weight_2ch }; 7026 static stbir__alpha_unweight_func * fancy_alpha_unweights[6] = { stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_2ch, stbir__fancy_alpha_unweight_2ch }; 7027 static stbir__alpha_weight_func * simple_alpha_weights[6] = { stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_2ch, stbir__simple_alpha_weight_2ch }; 7028 static stbir__alpha_unweight_func * simple_alpha_unweights[6] = { stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_2ch, stbir__simple_alpha_unweight_2ch }; 7029 7030 // initialize info fields 7031 info->alloced_mem = alloced; 7032 info->alloced_total = alloced_total; 7033 7034 info->channels = channels; 7035 info->effective_channels = effective_channels; 7036 7037 info->offset_x = new_x; 7038 info->offset_y = new_y; 7039 info->alloc_ring_buffer_num_entries = alloc_ring_buffer_num_entries; 7040 info->ring_buffer_num_entries = 0; 7041 info->ring_buffer_length_bytes = ring_buffer_length_bytes; 7042 info->splits = splits; 7043 info->vertical_first = vertical_first; 7044 7045 info->input_pixel_layout_internal = input_pixel_layout; 7046 info->output_pixel_layout_internal = output_pixel_layout; 7047 7048 // setup alpha weight functions 7049 info->alpha_weight = 0; 7050 info->alpha_unweight = 0; 7051 7052 // handle alpha weighting functions and overrides 7053 if ( alpha_weighting_type == 2 ) 7054 { 7055 // high quality alpha multiplying on the way in, dividing on the way out 7056 info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; 7057 info->alpha_unweight = fancy_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ]; 7058 } 7059 else if ( alpha_weighting_type == 4 ) 7060 { 7061 // fast alpha multiplying on the way in, dividing on the way out 7062 info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; 7063 info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ]; 7064 } 7065 else if ( alpha_weighting_type == 1 ) 7066 { 7067 // fast alpha on the way in, leave in premultiplied form on way out 7068 info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; 7069 } 7070 else if ( alpha_weighting_type == 3 ) 7071 { 7072 // incoming is premultiplied, fast alpha dividing on the way out - non-premultiplied output 7073 info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ]; 7074 } 7075 7076 // handle 3-chan color flipping, using the alpha weight path 7077 if ( ( ( input_pixel_layout == STBIRI_RGB ) && ( output_pixel_layout == STBIRI_BGR ) ) || 7078 ( ( input_pixel_layout == STBIRI_BGR ) && ( output_pixel_layout == STBIRI_RGB ) ) ) 7079 { 7080 // do the flipping on the smaller of the two ends 7081 if ( horizontal->scale_info.scale < 1.0f ) 7082 info->alpha_unweight = stbir__simple_flip_3ch; 7083 else 7084 info->alpha_weight = stbir__simple_flip_3ch; 7085 } 7086 7087 } 7088 7089 // get all the per-split buffers 7090 for( i = 0 ; i < splits ; i++ ) 7091 { 7092 STBIR__NEXT_PTR( info->split_info[i].decode_buffer, decode_buffer_size, float ); 7093 7094 #ifdef STBIR__SEPARATE_ALLOCATIONS 7095 7096 #ifdef STBIR_SIMD8 7097 if ( ( info ) && ( effective_channels == 3 ) ) 7098 ++info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer 7099 #endif 7100 7101 STBIR__NEXT_PTR( info->split_info[i].ring_buffers, alloc_ring_buffer_num_entries * sizeof(float*), float* ); 7102 { 7103 int j; 7104 for( j = 0 ; j < alloc_ring_buffer_num_entries ; j++ ) 7105 { 7106 STBIR__NEXT_PTR( info->split_info[i].ring_buffers[j], ring_buffer_length_bytes, float ); 7107 #ifdef STBIR_SIMD8 7108 if ( ( info ) && ( effective_channels == 3 ) ) 7109 ++info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer 7110 #endif 7111 } 7112 } 7113 #else 7114 STBIR__NEXT_PTR( info->split_info[i].ring_buffer, ring_buffer_size, float ); 7115 #endif 7116 STBIR__NEXT_PTR( info->split_info[i].vertical_buffer, vertical_buffer_size, float ); 7117 } 7118 7119 // alloc memory for to-be-pivoted coeffs (if necessary) 7120 if ( vertical->is_gather == 0 ) 7121 { 7122 int both; 7123 int temp_mem_amt; 7124 7125 // when in vertical scatter mode, we first build the coefficients in gather mode, and then pivot after, 7126 // that means we need two buffers, so we try to use the decode buffer and ring buffer for this. if that 7127 // is too small, we just allocate extra memory to use as this temp. 7128 7129 both = vertical->gather_prescatter_contributors_size + vertical->gather_prescatter_coefficients_size; 7130 7131 #ifdef STBIR__SEPARATE_ALLOCATIONS 7132 temp_mem_amt = decode_buffer_size; 7133 7134 #ifdef STBIR_SIMD8 7135 if ( effective_channels == 3 ) 7136 --temp_mem_amt; // avx in 3 channel mode needs one float at the start of the buffer 7137 #endif 7138 #else 7139 temp_mem_amt = ( decode_buffer_size + ring_buffer_size + vertical_buffer_size ) * splits; 7140 #endif 7141 if ( temp_mem_amt >= both ) 7142 { 7143 if ( info ) 7144 { 7145 vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer; 7146 vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size ); 7147 } 7148 } 7149 else 7150 { 7151 // ring+decode memory is too small, so allocate temp memory 7152 STBIR__NEXT_PTR( vertical->gather_prescatter_contributors, vertical->gather_prescatter_contributors_size, stbir__contributors ); 7153 STBIR__NEXT_PTR( vertical->gather_prescatter_coefficients, vertical->gather_prescatter_coefficients_size, float ); 7154 } 7155 } 7156 7157 STBIR__NEXT_PTR( horizontal->contributors, horizontal->contributors_size, stbir__contributors ); 7158 STBIR__NEXT_PTR( horizontal->coefficients, horizontal->coefficients_size, float ); 7159 7160 // are the two filters identical?? (happens a lot with mipmap generation) 7161 if ( ( horizontal->filter_kernel == vertical->filter_kernel ) && ( horizontal->filter_support == vertical->filter_support ) && ( horizontal->edge == vertical->edge ) && ( horizontal->scale_info.output_sub_size == vertical->scale_info.output_sub_size ) ) 7162 { 7163 float diff_scale = horizontal->scale_info.scale - vertical->scale_info.scale; 7164 float diff_shift = horizontal->scale_info.pixel_shift - vertical->scale_info.pixel_shift; 7165 if ( diff_scale < 0.0f ) diff_scale = -diff_scale; 7166 if ( diff_shift < 0.0f ) diff_shift = -diff_shift; 7167 if ( ( diff_scale <= stbir__small_float ) && ( diff_shift <= stbir__small_float ) ) 7168 { 7169 if ( horizontal->is_gather == vertical->is_gather ) 7170 { 7171 copy_horizontal = 1; 7172 goto no_vert_alloc; 7173 } 7174 // everything matches, but vertical is scatter, horizontal is gather, use horizontal coeffs for vertical pivot coeffs 7175 possibly_use_horizontal_for_pivot = horizontal; 7176 } 7177 } 7178 7179 STBIR__NEXT_PTR( vertical->contributors, vertical->contributors_size, stbir__contributors ); 7180 STBIR__NEXT_PTR( vertical->coefficients, vertical->coefficients_size, float ); 7181 7182 no_vert_alloc: 7183 7184 if ( info ) 7185 { 7186 STBIR_PROFILE_BUILD_START( horizontal ); 7187 7188 stbir__calculate_filters( horizontal, 0, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO ); 7189 7190 // setup the horizontal gather functions 7191 // start with defaulting to the n_coeffs functions (specialized on channels and remnant leftover) 7192 info->horizontal_gather_channels = stbir__horizontal_gather_n_coeffs_funcs[ effective_channels ][ horizontal->extent_info.widest & 3 ]; 7193 // but if the number of coeffs <= 12, use another set of special cases. <=12 coeffs is any enlarging resize, or shrinking resize down to about 1/3 size 7194 if ( horizontal->extent_info.widest <= 12 ) 7195 info->horizontal_gather_channels = stbir__horizontal_gather_channels_funcs[ effective_channels ][ horizontal->extent_info.widest - 1 ]; 7196 7197 info->scanline_extents.conservative.n0 = conservative->n0; 7198 info->scanline_extents.conservative.n1 = conservative->n1; 7199 7200 // get exact extents 7201 stbir__get_extents( horizontal, &info->scanline_extents ); 7202 7203 // pack the horizontal coeffs 7204 horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n0, info->scanline_extents.conservative.n1 ); 7205 7206 STBIR_MEMCPY( &info->horizontal, horizontal, sizeof( stbir__sampler ) ); 7207 7208 STBIR_PROFILE_BUILD_END( horizontal ); 7209 7210 if ( copy_horizontal ) 7211 { 7212 STBIR_MEMCPY( &info->vertical, horizontal, sizeof( stbir__sampler ) ); 7213 } 7214 else 7215 { 7216 STBIR_PROFILE_BUILD_START( vertical ); 7217 7218 stbir__calculate_filters( vertical, possibly_use_horizontal_for_pivot, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO ); 7219 STBIR_MEMCPY( &info->vertical, vertical, sizeof( stbir__sampler ) ); 7220 7221 STBIR_PROFILE_BUILD_END( vertical ); 7222 } 7223 7224 // setup the vertical split ranges 7225 stbir__get_split_info( info->split_info, info->splits, info->vertical.scale_info.output_sub_size, info->vertical.filter_pixel_margin, info->vertical.scale_info.input_full_size ); 7226 7227 // now we know precisely how many entries we need 7228 info->ring_buffer_num_entries = info->vertical.extent_info.widest; 7229 7230 // we never need more ring buffer entries than the scanlines we're outputting 7231 if ( ( !info->vertical.is_gather ) && ( info->ring_buffer_num_entries > conservative_split_output_size ) ) 7232 info->ring_buffer_num_entries = conservative_split_output_size; 7233 STBIR_ASSERT( info->ring_buffer_num_entries <= info->alloc_ring_buffer_num_entries ); 7234 7235 // a few of the horizontal gather functions read past the end of the decode (but mask it out), 7236 // so put in normal values so no snans or denormals accidentally sneak in (also, in the ring 7237 // buffer for vertical first) 7238 for( i = 0 ; i < splits ; i++ ) 7239 { 7240 int t, ofs, start; 7241 7242 ofs = decode_buffer_size / 4; 7243 7244 #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8) 7245 if ( effective_channels == 3 ) 7246 --ofs; // avx in 3 channel mode needs one float at the start of the buffer, so we snap back for clearing 7247 #endif 7248 7249 start = ofs - 4; 7250 if ( start < 0 ) start = 0; 7251 7252 for( t = start ; t < ofs; t++ ) 7253 info->split_info[i].decode_buffer[ t ] = 9999.0f; 7254 7255 if ( vertical_first ) 7256 { 7257 int j; 7258 for( j = 0; j < info->ring_buffer_num_entries ; j++ ) 7259 { 7260 for( t = start ; t < ofs; t++ ) 7261 stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ t ] = 9999.0f; 7262 } 7263 } 7264 } 7265 } 7266 7267 #undef STBIR__NEXT_PTR 7268 7269 7270 // is this the first time through loop? 7271 if ( info == 0 ) 7272 { 7273 alloced_total = ( 15 + (size_t)advance_mem ); 7274 alloced = STBIR_MALLOC( alloced_total, user_data ); 7275 if ( alloced == 0 ) 7276 return 0; 7277 } 7278 else 7279 return info; // success 7280 } 7281 } 7282 7283 static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count ) 7284 { 7285 stbir__per_split_info * split_info = info->split_info + split_start; 7286 7287 STBIR_PROFILE_CLEAR_EXTRAS(); 7288 7289 STBIR_PROFILE_FIRST_START( looping ); 7290 if (info->vertical.is_gather) 7291 stbir__vertical_gather_loop( info, split_info, split_count ); 7292 else 7293 stbir__vertical_scatter_loop( info, split_info, split_count ); 7294 STBIR_PROFILE_END( looping ); 7295 7296 return 1; 7297 } 7298 7299 static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * resize ) 7300 { 7301 static stbir__decode_pixels_func * decode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]= 7302 { 7303 /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear, 7304 }; 7305 7306 static stbir__decode_pixels_func * decode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]= 7307 { 7308 { /* RGBA */ stbir__decode_uint8_srgb4_linearalpha, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear }, 7309 { /* BGRA */ stbir__decode_uint8_srgb4_linearalpha_BGRA, stbir__decode_uint8_srgb_BGRA, 0, stbir__decode_float_linear_BGRA, stbir__decode_half_float_linear_BGRA }, 7310 { /* ARGB */ stbir__decode_uint8_srgb4_linearalpha_ARGB, stbir__decode_uint8_srgb_ARGB, 0, stbir__decode_float_linear_ARGB, stbir__decode_half_float_linear_ARGB }, 7311 { /* ABGR */ stbir__decode_uint8_srgb4_linearalpha_ABGR, stbir__decode_uint8_srgb_ABGR, 0, stbir__decode_float_linear_ABGR, stbir__decode_half_float_linear_ABGR }, 7312 { /* RA */ stbir__decode_uint8_srgb2_linearalpha, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear }, 7313 { /* AR */ stbir__decode_uint8_srgb2_linearalpha_AR, stbir__decode_uint8_srgb_AR, 0, stbir__decode_float_linear_AR, stbir__decode_half_float_linear_AR }, 7314 }; 7315 7316 static stbir__decode_pixels_func * decode_simple_scaled_or_not[2][2]= 7317 { 7318 { stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear }, 7319 }; 7320 7321 static stbir__decode_pixels_func * decode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]= 7322 { 7323 { /* RGBA */ { stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear } }, 7324 { /* BGRA */ { stbir__decode_uint8_linear_scaled_BGRA, stbir__decode_uint8_linear_BGRA }, { stbir__decode_uint16_linear_scaled_BGRA, stbir__decode_uint16_linear_BGRA } }, 7325 { /* ARGB */ { stbir__decode_uint8_linear_scaled_ARGB, stbir__decode_uint8_linear_ARGB }, { stbir__decode_uint16_linear_scaled_ARGB, stbir__decode_uint16_linear_ARGB } }, 7326 { /* ABGR */ { stbir__decode_uint8_linear_scaled_ABGR, stbir__decode_uint8_linear_ABGR }, { stbir__decode_uint16_linear_scaled_ABGR, stbir__decode_uint16_linear_ABGR } }, 7327 { /* RA */ { stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear } }, 7328 { /* AR */ { stbir__decode_uint8_linear_scaled_AR, stbir__decode_uint8_linear_AR }, { stbir__decode_uint16_linear_scaled_AR, stbir__decode_uint16_linear_AR } } 7329 }; 7330 7331 static stbir__encode_pixels_func * encode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]= 7332 { 7333 /* 1ch-4ch */ stbir__encode_uint8_srgb, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear, 7334 }; 7335 7336 static stbir__encode_pixels_func * encode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]= 7337 { 7338 { /* RGBA */ stbir__encode_uint8_srgb4_linearalpha, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear }, 7339 { /* BGRA */ stbir__encode_uint8_srgb4_linearalpha_BGRA, stbir__encode_uint8_srgb_BGRA, 0, stbir__encode_float_linear_BGRA, stbir__encode_half_float_linear_BGRA }, 7340 { /* ARGB */ stbir__encode_uint8_srgb4_linearalpha_ARGB, stbir__encode_uint8_srgb_ARGB, 0, stbir__encode_float_linear_ARGB, stbir__encode_half_float_linear_ARGB }, 7341 { /* ABGR */ stbir__encode_uint8_srgb4_linearalpha_ABGR, stbir__encode_uint8_srgb_ABGR, 0, stbir__encode_float_linear_ABGR, stbir__encode_half_float_linear_ABGR }, 7342 { /* RA */ stbir__encode_uint8_srgb2_linearalpha, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear }, 7343 { /* AR */ stbir__encode_uint8_srgb2_linearalpha_AR, stbir__encode_uint8_srgb_AR, 0, stbir__encode_float_linear_AR, stbir__encode_half_float_linear_AR } 7344 }; 7345 7346 static stbir__encode_pixels_func * encode_simple_scaled_or_not[2][2]= 7347 { 7348 { stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear }, 7349 }; 7350 7351 static stbir__encode_pixels_func * encode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]= 7352 { 7353 { /* RGBA */ { stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear } }, 7354 { /* BGRA */ { stbir__encode_uint8_linear_scaled_BGRA, stbir__encode_uint8_linear_BGRA }, { stbir__encode_uint16_linear_scaled_BGRA, stbir__encode_uint16_linear_BGRA } }, 7355 { /* ARGB */ { stbir__encode_uint8_linear_scaled_ARGB, stbir__encode_uint8_linear_ARGB }, { stbir__encode_uint16_linear_scaled_ARGB, stbir__encode_uint16_linear_ARGB } }, 7356 { /* ABGR */ { stbir__encode_uint8_linear_scaled_ABGR, stbir__encode_uint8_linear_ABGR }, { stbir__encode_uint16_linear_scaled_ABGR, stbir__encode_uint16_linear_ABGR } }, 7357 { /* RA */ { stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear } }, 7358 { /* AR */ { stbir__encode_uint8_linear_scaled_AR, stbir__encode_uint8_linear_AR }, { stbir__encode_uint16_linear_scaled_AR, stbir__encode_uint16_linear_AR } } 7359 }; 7360 7361 stbir__decode_pixels_func * decode_pixels = 0; 7362 stbir__encode_pixels_func * encode_pixels = 0; 7363 stbir_datatype input_type, output_type; 7364 7365 input_type = resize->input_data_type; 7366 output_type = resize->output_data_type; 7367 info->input_data = resize->input_pixels; 7368 info->input_stride_bytes = resize->input_stride_in_bytes; 7369 info->output_stride_bytes = resize->output_stride_in_bytes; 7370 7371 // if we're completely point sampling, then we can turn off SRGB 7372 if ( ( info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE ) ) 7373 { 7374 if ( ( ( input_type == STBIR_TYPE_UINT8_SRGB ) || ( input_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) && 7375 ( ( output_type == STBIR_TYPE_UINT8_SRGB ) || ( output_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) ) 7376 { 7377 input_type = STBIR_TYPE_UINT8; 7378 output_type = STBIR_TYPE_UINT8; 7379 } 7380 } 7381 7382 // recalc the output and input strides 7383 if ( info->input_stride_bytes == 0 ) 7384 info->input_stride_bytes = info->channels * info->horizontal.scale_info.input_full_size * stbir__type_size[input_type]; 7385 7386 if ( info->output_stride_bytes == 0 ) 7387 info->output_stride_bytes = info->channels * info->horizontal.scale_info.output_sub_size * stbir__type_size[output_type]; 7388 7389 // calc offset 7390 info->output_data = ( (char*) resize->output_pixels ) + ( (size_t) info->offset_y * (size_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] ); 7391 7392 info->in_pixels_cb = resize->input_cb; 7393 info->user_data = resize->user_data; 7394 info->out_pixels_cb = resize->output_cb; 7395 7396 // setup the input format converters 7397 if ( ( input_type == STBIR_TYPE_UINT8 ) || ( input_type == STBIR_TYPE_UINT16 ) ) 7398 { 7399 int non_scaled = 0; 7400 7401 // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16) 7402 if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual) 7403 if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) ) 7404 non_scaled = 1; 7405 7406 if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL ) 7407 decode_pixels = decode_simple_scaled_or_not[ input_type == STBIR_TYPE_UINT16 ][ non_scaled ]; 7408 else 7409 decode_pixels = decode_alphas_scaled_or_not[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type == STBIR_TYPE_UINT16 ][ non_scaled ]; 7410 } 7411 else 7412 { 7413 if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL ) 7414 decode_pixels = decode_simple[ input_type - STBIR_TYPE_UINT8_SRGB ]; 7415 else 7416 decode_pixels = decode_alphas[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type - STBIR_TYPE_UINT8_SRGB ]; 7417 } 7418 7419 // setup the output format converters 7420 if ( ( output_type == STBIR_TYPE_UINT8 ) || ( output_type == STBIR_TYPE_UINT16 ) ) 7421 { 7422 int non_scaled = 0; 7423 7424 // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16) 7425 if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual) 7426 if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) ) 7427 non_scaled = 1; 7428 7429 if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL ) 7430 encode_pixels = encode_simple_scaled_or_not[ output_type == STBIR_TYPE_UINT16 ][ non_scaled ]; 7431 else 7432 encode_pixels = encode_alphas_scaled_or_not[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type == STBIR_TYPE_UINT16 ][ non_scaled ]; 7433 } 7434 else 7435 { 7436 if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL ) 7437 encode_pixels = encode_simple[ output_type - STBIR_TYPE_UINT8_SRGB ]; 7438 else 7439 encode_pixels = encode_alphas[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type - STBIR_TYPE_UINT8_SRGB ]; 7440 } 7441 7442 info->input_type = input_type; 7443 info->output_type = output_type; 7444 info->decode_pixels = decode_pixels; 7445 info->encode_pixels = encode_pixels; 7446 } 7447 7448 static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, double * u1 ) 7449 { 7450 double per, adj; 7451 int over; 7452 7453 // do left/top edge 7454 if ( *outx < 0 ) 7455 { 7456 per = ( (double)*outx ) / ( (double)*outsubw ); // is negative 7457 adj = per * ( *u1 - *u0 ); 7458 *u0 -= adj; // increases u0 7459 *outx = 0; 7460 } 7461 7462 // do right/bot edge 7463 over = outw - ( *outx + *outsubw ); 7464 if ( over < 0 ) 7465 { 7466 per = ( (double)over ) / ( (double)*outsubw ); // is negative 7467 adj = per * ( *u1 - *u0 ); 7468 *u1 += adj; // decrease u1 7469 *outsubw = outw - *outx; 7470 } 7471 } 7472 7473 // converts a double to a rational that has less than one float bit of error (returns 0 if unable to do so) 7474 static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer, stbir_uint32 *denom, int limit_denom ) // limit_denom (1) or limit numer (0) 7475 { 7476 double err; 7477 stbir_uint64 top, bot; 7478 stbir_uint64 numer_last = 0; 7479 stbir_uint64 denom_last = 1; 7480 stbir_uint64 numer_estimate = 1; 7481 stbir_uint64 denom_estimate = 0; 7482 7483 // scale to past float error range 7484 top = (stbir_uint64)( f * (double)(1 << 25) ); 7485 bot = 1 << 25; 7486 7487 // keep refining, but usually stops in a few loops - usually 5 for bad cases 7488 for(;;) 7489 { 7490 stbir_uint64 est, temp; 7491 7492 // hit limit, break out and do best full range estimate 7493 if ( ( ( limit_denom ) ? denom_estimate : numer_estimate ) >= limit ) 7494 break; 7495 7496 // is the current error less than 1 bit of a float? if so, we're done 7497 if ( denom_estimate ) 7498 { 7499 err = ( (double)numer_estimate / (double)denom_estimate ) - f; 7500 if ( err < 0.0 ) err = -err; 7501 if ( err < ( 1.0 / (double)(1<<24) ) ) 7502 { 7503 // yup, found it 7504 *numer = (stbir_uint32) numer_estimate; 7505 *denom = (stbir_uint32) denom_estimate; 7506 return 1; 7507 } 7508 } 7509 7510 // no more refinement bits left? break out and do full range estimate 7511 if ( bot == 0 ) 7512 break; 7513 7514 // gcd the estimate bits 7515 est = top / bot; 7516 temp = top % bot; 7517 top = bot; 7518 bot = temp; 7519 7520 // move remainders 7521 temp = est * denom_estimate + denom_last; 7522 denom_last = denom_estimate; 7523 denom_estimate = temp; 7524 7525 // move remainders 7526 temp = est * numer_estimate + numer_last; 7527 numer_last = numer_estimate; 7528 numer_estimate = temp; 7529 } 7530 7531 // we didn't fine anything good enough for float, use a full range estimate 7532 if ( limit_denom ) 7533 { 7534 numer_estimate= (stbir_uint64)( f * (double)limit + 0.5 ); 7535 denom_estimate = limit; 7536 } 7537 else 7538 { 7539 numer_estimate = limit; 7540 denom_estimate = (stbir_uint64)( ( (double)limit / f ) + 0.5 ); 7541 } 7542 7543 *numer = (stbir_uint32) numer_estimate; 7544 *denom = (stbir_uint32) denom_estimate; 7545 7546 err = ( denom_estimate ) ? ( ( (double)(stbir_uint32)numer_estimate / (double)(stbir_uint32)denom_estimate ) - f ) : 1.0; 7547 if ( err < 0.0 ) err = -err; 7548 return ( err < ( 1.0 / (double)(1<<24) ) ) ? 1 : 0; 7549 } 7550 7551 static int stbir__calculate_region_transform( stbir__scale_info * scale_info, int output_full_range, int * output_offset, int output_sub_range, int input_full_range, double input_s0, double input_s1 ) 7552 { 7553 double output_range, input_range, output_s, input_s, ratio, scale; 7554 7555 input_s = input_s1 - input_s0; 7556 7557 // null area 7558 if ( ( output_full_range == 0 ) || ( input_full_range == 0 ) || 7559 ( output_sub_range == 0 ) || ( input_s <= stbir__small_float ) ) 7560 return 0; 7561 7562 // are either of the ranges completely out of bounds? 7563 if ( ( *output_offset >= output_full_range ) || ( ( *output_offset + output_sub_range ) <= 0 ) || ( input_s0 >= (1.0f-stbir__small_float) ) || ( input_s1 <= stbir__small_float ) ) 7564 return 0; 7565 7566 output_range = (double)output_full_range; 7567 input_range = (double)input_full_range; 7568 7569 output_s = ( (double)output_sub_range) / output_range; 7570 7571 // figure out the scaling to use 7572 ratio = output_s / input_s; 7573 7574 // save scale before clipping 7575 scale = ( output_range / input_range ) * ratio; 7576 scale_info->scale = (float)scale; 7577 scale_info->inv_scale = (float)( 1.0 / scale ); 7578 7579 // clip output area to left/right output edges (and adjust input area) 7580 stbir__clip( output_offset, &output_sub_range, output_full_range, &input_s0, &input_s1 ); 7581 7582 // recalc input area 7583 input_s = input_s1 - input_s0; 7584 7585 // after clipping do we have zero input area? 7586 if ( input_s <= stbir__small_float ) 7587 return 0; 7588 7589 // calculate and store the starting source offsets in output pixel space 7590 scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range ); 7591 7592 scale_info->scale_is_rational = stbir__double_to_rational( scale, ( scale <= 1.0 ) ? output_full_range : input_full_range, &scale_info->scale_numerator, &scale_info->scale_denominator, ( scale >= 1.0 ) ); 7593 7594 scale_info->input_full_size = input_full_range; 7595 scale_info->output_sub_size = output_sub_range; 7596 7597 return 1; 7598 } 7599 7600 7601 static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layout pixel_layout, stbir_datatype data_type ) 7602 { 7603 resize->input_cb = 0; 7604 resize->output_cb = 0; 7605 resize->user_data = resize; 7606 resize->samplers = 0; 7607 resize->called_alloc = 0; 7608 resize->horizontal_filter = STBIR_FILTER_DEFAULT; 7609 resize->horizontal_filter_kernel = 0; resize->horizontal_filter_support = 0; 7610 resize->vertical_filter = STBIR_FILTER_DEFAULT; 7611 resize->vertical_filter_kernel = 0; resize->vertical_filter_support = 0; 7612 resize->horizontal_edge = STBIR_EDGE_CLAMP; 7613 resize->vertical_edge = STBIR_EDGE_CLAMP; 7614 resize->input_s0 = 0; resize->input_t0 = 0; resize->input_s1 = 1; resize->input_t1 = 1; 7615 resize->output_subx = 0; resize->output_suby = 0; resize->output_subw = resize->output_w; resize->output_subh = resize->output_h; 7616 resize->input_data_type = data_type; 7617 resize->output_data_type = data_type; 7618 resize->input_pixel_layout_public = pixel_layout; 7619 resize->output_pixel_layout_public = pixel_layout; 7620 resize->needs_rebuild = 1; 7621 } 7622 7623 STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize, 7624 const void *input_pixels, int input_w, int input_h, int input_stride_in_bytes, // stride can be zero 7625 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero 7626 stbir_pixel_layout pixel_layout, stbir_datatype data_type ) 7627 { 7628 resize->input_pixels = input_pixels; 7629 resize->input_w = input_w; 7630 resize->input_h = input_h; 7631 resize->input_stride_in_bytes = input_stride_in_bytes; 7632 resize->output_pixels = output_pixels; 7633 resize->output_w = output_w; 7634 resize->output_h = output_h; 7635 resize->output_stride_in_bytes = output_stride_in_bytes; 7636 resize->fast_alpha = 0; 7637 7638 stbir__init_and_set_layout( resize, pixel_layout, data_type ); 7639 } 7640 7641 // You can update parameters any time after resize_init 7642 STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type ) // by default, datatype from resize_init 7643 { 7644 resize->input_data_type = input_type; 7645 resize->output_data_type = output_type; 7646 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) 7647 stbir__update_info_from_resize( resize->samplers, resize ); 7648 } 7649 7650 STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb ) // no callbacks by default 7651 { 7652 resize->input_cb = input_cb; 7653 resize->output_cb = output_cb; 7654 7655 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) 7656 { 7657 resize->samplers->in_pixels_cb = input_cb; 7658 resize->samplers->out_pixels_cb = output_cb; 7659 } 7660 } 7661 7662 STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data ) // pass back STBIR_RESIZE* by default 7663 { 7664 resize->user_data = user_data; 7665 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) 7666 resize->samplers->user_data = user_data; 7667 } 7668 7669 STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes ) 7670 { 7671 resize->input_pixels = input_pixels; 7672 resize->input_stride_in_bytes = input_stride_in_bytes; 7673 resize->output_pixels = output_pixels; 7674 resize->output_stride_in_bytes = output_stride_in_bytes; 7675 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) 7676 stbir__update_info_from_resize( resize->samplers, resize ); 7677 } 7678 7679 7680 STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge ) // CLAMP by default 7681 { 7682 resize->horizontal_edge = horizontal_edge; 7683 resize->vertical_edge = vertical_edge; 7684 resize->needs_rebuild = 1; 7685 return 1; 7686 } 7687 7688 STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ) // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default 7689 { 7690 resize->horizontal_filter = horizontal_filter; 7691 resize->vertical_filter = vertical_filter; 7692 resize->needs_rebuild = 1; 7693 return 1; 7694 } 7695 7696 STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support ) 7697 { 7698 resize->horizontal_filter_kernel = horizontal_filter; resize->horizontal_filter_support = horizontal_support; 7699 resize->vertical_filter_kernel = vertical_filter; resize->vertical_filter_support = vertical_support; 7700 resize->needs_rebuild = 1; 7701 return 1; 7702 } 7703 7704 STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout ) // sets new pixel layouts 7705 { 7706 resize->input_pixel_layout_public = input_pixel_layout; 7707 resize->output_pixel_layout_public = output_pixel_layout; 7708 resize->needs_rebuild = 1; 7709 return 1; 7710 } 7711 7712 7713 STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality ) // sets alpha speed 7714 { 7715 resize->fast_alpha = non_pma_alpha_speed_over_quality; 7716 resize->needs_rebuild = 1; 7717 return 1; 7718 } 7719 7720 STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 ) // sets input region (full region by default) 7721 { 7722 resize->input_s0 = s0; 7723 resize->input_t0 = t0; 7724 resize->input_s1 = s1; 7725 resize->input_t1 = t1; 7726 resize->needs_rebuild = 1; 7727 7728 // are we inbounds? 7729 if ( ( s1 < stbir__small_float ) || ( (s1-s0) < stbir__small_float ) || 7730 ( t1 < stbir__small_float ) || ( (t1-t0) < stbir__small_float ) || 7731 ( s0 > (1.0f-stbir__small_float) ) || 7732 ( t0 > (1.0f-stbir__small_float) ) ) 7733 return 0; 7734 7735 return 1; 7736 } 7737 7738 STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ) // sets input region (full region by default) 7739 { 7740 resize->output_subx = subx; 7741 resize->output_suby = suby; 7742 resize->output_subw = subw; 7743 resize->output_subh = subh; 7744 resize->needs_rebuild = 1; 7745 7746 // are we inbounds? 7747 if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) ) 7748 return 0; 7749 7750 return 1; 7751 } 7752 7753 STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ) // sets both regions (full regions by default) 7754 { 7755 double s0, t0, s1, t1; 7756 7757 s0 = ( (double)subx ) / ( (double)resize->output_w ); 7758 t0 = ( (double)suby ) / ( (double)resize->output_h ); 7759 s1 = ( (double)(subx+subw) ) / ( (double)resize->output_w ); 7760 t1 = ( (double)(suby+subh) ) / ( (double)resize->output_h ); 7761 7762 resize->input_s0 = s0; 7763 resize->input_t0 = t0; 7764 resize->input_s1 = s1; 7765 resize->input_t1 = t1; 7766 resize->output_subx = subx; 7767 resize->output_suby = suby; 7768 resize->output_subw = subw; 7769 resize->output_subh = subh; 7770 resize->needs_rebuild = 1; 7771 7772 // are we inbounds? 7773 if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) ) 7774 return 0; 7775 7776 return 1; 7777 } 7778 7779 static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) 7780 { 7781 stbir__contributors conservative = { 0, 0 }; 7782 stbir__sampler horizontal, vertical; 7783 int new_output_subx, new_output_suby; 7784 stbir__info * out_info; 7785 #ifdef STBIR_PROFILE 7786 stbir__info profile_infod; // used to contain building profile info before everything is allocated 7787 stbir__info * profile_info = &profile_infod; 7788 #endif 7789 7790 // have we already built the samplers? 7791 if ( resize->samplers ) 7792 return 0; 7793 7794 #define STBIR_RETURN_ERROR_AND_ASSERT( exp ) STBIR_ASSERT( !(exp) ); if (exp) return 0; 7795 STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->horizontal_filter >= STBIR_FILTER_OTHER) 7796 STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->vertical_filter >= STBIR_FILTER_OTHER) 7797 #undef STBIR_RETURN_ERROR_AND_ASSERT 7798 7799 if ( splits <= 0 ) 7800 return 0; 7801 7802 STBIR_PROFILE_BUILD_FIRST_START( build ); 7803 7804 new_output_subx = resize->output_subx; 7805 new_output_suby = resize->output_suby; 7806 7807 // do horizontal clip and scale calcs 7808 if ( !stbir__calculate_region_transform( &horizontal.scale_info, resize->output_w, &new_output_subx, resize->output_subw, resize->input_w, resize->input_s0, resize->input_s1 ) ) 7809 return 0; 7810 7811 // do vertical clip and scale calcs 7812 if ( !stbir__calculate_region_transform( &vertical.scale_info, resize->output_h, &new_output_suby, resize->output_subh, resize->input_h, resize->input_t0, resize->input_t1 ) ) 7813 return 0; 7814 7815 // if nothing to do, just return 7816 if ( ( horizontal.scale_info.output_sub_size == 0 ) || ( vertical.scale_info.output_sub_size == 0 ) ) 7817 return 0; 7818 7819 stbir__set_sampler(&horizontal, resize->horizontal_filter, resize->horizontal_filter_kernel, resize->horizontal_filter_support, resize->horizontal_edge, &horizontal.scale_info, 1, resize->user_data ); 7820 stbir__get_conservative_extents( &horizontal, &conservative, resize->user_data ); 7821 stbir__set_sampler(&vertical, resize->vertical_filter, resize->horizontal_filter_kernel, resize->vertical_filter_support, resize->vertical_edge, &vertical.scale_info, 0, resize->user_data ); 7822 7823 if ( ( vertical.scale_info.output_sub_size / splits ) < STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS ) // each split should be a minimum of 4 scanlines (handwavey choice) 7824 { 7825 splits = vertical.scale_info.output_sub_size / STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS; 7826 if ( splits == 0 ) splits = 1; 7827 } 7828 7829 STBIR_PROFILE_BUILD_START( alloc ); 7830 out_info = stbir__alloc_internal_mem_and_build_samplers( &horizontal, &vertical, &conservative, resize->input_pixel_layout_public, resize->output_pixel_layout_public, splits, new_output_subx, new_output_suby, resize->fast_alpha, resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO ); 7831 STBIR_PROFILE_BUILD_END( alloc ); 7832 STBIR_PROFILE_BUILD_END( build ); 7833 7834 if ( out_info ) 7835 { 7836 resize->splits = splits; 7837 resize->samplers = out_info; 7838 resize->needs_rebuild = 0; 7839 #ifdef STBIR_PROFILE 7840 STBIR_MEMCPY( &out_info->profile, &profile_infod.profile, sizeof( out_info->profile ) ); 7841 #endif 7842 7843 // update anything that can be changed without recalcing samplers 7844 stbir__update_info_from_resize( out_info, resize ); 7845 7846 return splits; 7847 } 7848 7849 return 0; 7850 } 7851 7852 void stbir_free_samplers( STBIR_RESIZE * resize ) 7853 { 7854 if ( resize->samplers ) 7855 { 7856 stbir__free_internal_mem( resize->samplers ); 7857 resize->samplers = 0; 7858 resize->called_alloc = 0; 7859 } 7860 } 7861 7862 STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int splits ) 7863 { 7864 if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) ) 7865 { 7866 if ( resize->samplers ) 7867 stbir_free_samplers( resize ); 7868 7869 resize->called_alloc = 1; 7870 return stbir__perform_build( resize, splits ); 7871 } 7872 7873 STBIR_PROFILE_BUILD_CLEAR( resize->samplers ); 7874 7875 return 1; 7876 } 7877 7878 STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize ) 7879 { 7880 return stbir_build_samplers_with_splits( resize, 1 ); 7881 } 7882 7883 STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ) 7884 { 7885 int result; 7886 7887 if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) ) 7888 { 7889 int alloc_state = resize->called_alloc; // remember allocated state 7890 7891 if ( resize->samplers ) 7892 { 7893 stbir__free_internal_mem( resize->samplers ); 7894 resize->samplers = 0; 7895 } 7896 7897 if ( !stbir_build_samplers( resize ) ) 7898 return 0; 7899 7900 resize->called_alloc = alloc_state; 7901 7902 // if build_samplers succeeded (above), but there are no samplers set, then 7903 // the area to stretch into was zero pixels, so don't do anything and return 7904 // success 7905 if ( resize->samplers == 0 ) 7906 return 1; 7907 } 7908 else 7909 { 7910 // didn't build anything - clear it 7911 STBIR_PROFILE_BUILD_CLEAR( resize->samplers ); 7912 } 7913 7914 // do resize 7915 result = stbir__perform_resize( resize->samplers, 0, resize->splits ); 7916 7917 // if we alloced, then free 7918 if ( !resize->called_alloc ) 7919 { 7920 stbir_free_samplers( resize ); 7921 resize->samplers = 0; 7922 } 7923 7924 return result; 7925 } 7926 7927 STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count ) 7928 { 7929 STBIR_ASSERT( resize->samplers ); 7930 7931 // if we're just doing the whole thing, call full 7932 if ( ( split_start == -1 ) || ( ( split_start == 0 ) && ( split_count == resize->splits ) ) ) 7933 return stbir_resize_extended( resize ); 7934 7935 // you **must** build samplers first when using split resize 7936 if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) ) 7937 return 0; 7938 7939 if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) ) 7940 return 0; 7941 7942 // do resize 7943 return stbir__perform_resize( resize->samplers, split_start, split_count ); 7944 } 7945 7946 static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * output_pixels, int type_size, int output_w, int output_h, int output_stride_in_bytes, stbir_internal_pixel_layout pixel_layout ) 7947 { 7948 size_t size; 7949 int pitch; 7950 void * ptr; 7951 7952 pitch = output_w * type_size * stbir__pixel_channels[ pixel_layout ]; 7953 if ( pitch == 0 ) 7954 return 0; 7955 7956 if ( output_stride_in_bytes == 0 ) 7957 output_stride_in_bytes = pitch; 7958 7959 if ( output_stride_in_bytes < pitch ) 7960 return 0; 7961 7962 size = (size_t)output_stride_in_bytes * (size_t)output_h; 7963 if ( size == 0 ) 7964 return 0; 7965 7966 *ret_ptr = 0; 7967 *ret_pitch = output_stride_in_bytes; 7968 7969 if ( output_pixels == 0 ) 7970 { 7971 ptr = STBIR_MALLOC( size, 0 ); 7972 if ( ptr == 0 ) 7973 return 0; 7974 7975 *ret_ptr = ptr; 7976 *ret_pitch = pitch; 7977 } 7978 7979 return 1; 7980 } 7981 7982 7983 STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 7984 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 7985 stbir_pixel_layout pixel_layout ) 7986 { 7987 STBIR_RESIZE resize; 7988 unsigned char * optr; 7989 int opitch; 7990 7991 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) 7992 return 0; 7993 7994 stbir_resize_init( &resize, 7995 input_pixels, input_w, input_h, input_stride_in_bytes, 7996 (optr) ? optr : output_pixels, output_w, output_h, opitch, 7997 pixel_layout, STBIR_TYPE_UINT8 ); 7998 7999 if ( !stbir_resize_extended( &resize ) ) 8000 { 8001 if ( optr ) 8002 STBIR_FREE( optr, 0 ); 8003 return 0; 8004 } 8005 8006 return (optr) ? optr : output_pixels; 8007 } 8008 8009 STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 8010 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 8011 stbir_pixel_layout pixel_layout ) 8012 { 8013 STBIR_RESIZE resize; 8014 unsigned char * optr; 8015 int opitch; 8016 8017 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) 8018 return 0; 8019 8020 stbir_resize_init( &resize, 8021 input_pixels, input_w, input_h, input_stride_in_bytes, 8022 (optr) ? optr : output_pixels, output_w, output_h, opitch, 8023 pixel_layout, STBIR_TYPE_UINT8_SRGB ); 8024 8025 if ( !stbir_resize_extended( &resize ) ) 8026 { 8027 if ( optr ) 8028 STBIR_FREE( optr, 0 ); 8029 return 0; 8030 } 8031 8032 return (optr) ? optr : output_pixels; 8033 } 8034 8035 8036 STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 8037 float *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 8038 stbir_pixel_layout pixel_layout ) 8039 { 8040 STBIR_RESIZE resize; 8041 float * optr; 8042 int opitch; 8043 8044 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( float ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) 8045 return 0; 8046 8047 stbir_resize_init( &resize, 8048 input_pixels, input_w, input_h, input_stride_in_bytes, 8049 (optr) ? optr : output_pixels, output_w, output_h, opitch, 8050 pixel_layout, STBIR_TYPE_FLOAT ); 8051 8052 if ( !stbir_resize_extended( &resize ) ) 8053 { 8054 if ( optr ) 8055 STBIR_FREE( optr, 0 ); 8056 return 0; 8057 } 8058 8059 return (optr) ? optr : output_pixels; 8060 } 8061 8062 8063 STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes, 8064 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, 8065 stbir_pixel_layout pixel_layout, stbir_datatype data_type, 8066 stbir_edge edge, stbir_filter filter ) 8067 { 8068 STBIR_RESIZE resize; 8069 float * optr; 8070 int opitch; 8071 8072 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, stbir__type_size[data_type], output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) 8073 return 0; 8074 8075 stbir_resize_init( &resize, 8076 input_pixels, input_w, input_h, input_stride_in_bytes, 8077 (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes, 8078 pixel_layout, data_type ); 8079 8080 resize.horizontal_edge = edge; 8081 resize.vertical_edge = edge; 8082 resize.horizontal_filter = filter; 8083 resize.vertical_filter = filter; 8084 8085 if ( !stbir_resize_extended( &resize ) ) 8086 { 8087 if ( optr ) 8088 STBIR_FREE( optr, 0 ); 8089 return 0; 8090 } 8091 8092 return (optr) ? optr : output_pixels; 8093 } 8094 8095 #ifdef STBIR_PROFILE 8096 8097 STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize ) 8098 { 8099 static char const * bdescriptions[6] = { "Building", "Allocating", "Horizontal sampler", "Vertical sampler", "Coefficient cleanup", "Coefficient piovot" } ; 8100 stbir__info* samp = resize->samplers; 8101 int i; 8102 8103 typedef int testa[ (STBIR__ARRAY_SIZE( bdescriptions ) == (STBIR__ARRAY_SIZE( samp->profile.array )-1) )?1:-1]; 8104 typedef int testb[ (sizeof( samp->profile.array ) == (sizeof(samp->profile.named)) )?1:-1]; 8105 typedef int testc[ (sizeof( info->clocks ) >= (sizeof(samp->profile.named)) )?1:-1]; 8106 8107 for( i = 0 ; i < STBIR__ARRAY_SIZE( bdescriptions ) ; i++) 8108 info->clocks[i] = samp->profile.array[i+1]; 8109 8110 info->total_clocks = samp->profile.named.total; 8111 info->descriptions = bdescriptions; 8112 info->count = STBIR__ARRAY_SIZE( bdescriptions ); 8113 } 8114 8115 STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize, int split_start, int split_count ) 8116 { 8117 static char const * descriptions[7] = { "Looping", "Vertical sampling", "Horizontal sampling", "Scanline input", "Scanline output", "Alpha weighting", "Alpha unweighting" }; 8118 stbir__per_split_info * split_info; 8119 int s, i; 8120 8121 typedef int testa[ (STBIR__ARRAY_SIZE( descriptions ) == (STBIR__ARRAY_SIZE( split_info->profile.array )-1) )?1:-1]; 8122 typedef int testb[ (sizeof( split_info->profile.array ) == (sizeof(split_info->profile.named)) )?1:-1]; 8123 typedef int testc[ (sizeof( info->clocks ) >= (sizeof(split_info->profile.named)) )?1:-1]; 8124 8125 if ( split_start == -1 ) 8126 { 8127 split_start = 0; 8128 split_count = resize->samplers->splits; 8129 } 8130 8131 if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) ) 8132 { 8133 info->total_clocks = 0; 8134 info->descriptions = 0; 8135 info->count = 0; 8136 return; 8137 } 8138 8139 split_info = resize->samplers->split_info + split_start; 8140 8141 // sum up the profile from all the splits 8142 for( i = 0 ; i < STBIR__ARRAY_SIZE( descriptions ) ; i++ ) 8143 { 8144 stbir_uint64 sum = 0; 8145 for( s = 0 ; s < split_count ; s++ ) 8146 sum += split_info[s].profile.array[i+1]; 8147 info->clocks[i] = sum; 8148 } 8149 8150 info->total_clocks = split_info->profile.named.total; 8151 info->descriptions = descriptions; 8152 info->count = STBIR__ARRAY_SIZE( descriptions ); 8153 } 8154 8155 STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize ) 8156 { 8157 stbir_resize_split_profile_info( info, resize, -1, 0 ); 8158 } 8159 8160 #endif // STBIR_PROFILE 8161 8162 #undef STBIR_BGR 8163 #undef STBIR_1CHANNEL 8164 #undef STBIR_2CHANNEL 8165 #undef STBIR_RGB 8166 #undef STBIR_RGBA 8167 #undef STBIR_4CHANNEL 8168 #undef STBIR_BGRA 8169 #undef STBIR_ARGB 8170 #undef STBIR_ABGR 8171 #undef STBIR_RA 8172 #undef STBIR_AR 8173 #undef STBIR_RGBA_PM 8174 #undef STBIR_BGRA_PM 8175 #undef STBIR_ARGB_PM 8176 #undef STBIR_ABGR_PM 8177 #undef STBIR_RA_PM 8178 #undef STBIR_AR_PM 8179 8180 #endif // STB_IMAGE_RESIZE_IMPLEMENTATION 8181 8182 #else // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS 8183 8184 // we reinclude the header file to define all the horizontal functions 8185 // specializing each function for the number of coeffs is 20-40% faster *OVERALL* 8186 8187 // by including the header file again this way, we can still debug the functions 8188 8189 #define STBIR_strs_join2( start, mid, end ) start##mid##end 8190 #define STBIR_strs_join1( start, mid, end ) STBIR_strs_join2( start, mid, end ) 8191 8192 #define STBIR_strs_join24( start, mid1, mid2, end ) start##mid1##mid2##end 8193 #define STBIR_strs_join14( start, mid1, mid2, end ) STBIR_strs_join24( start, mid1, mid2, end ) 8194 8195 #ifdef STB_IMAGE_RESIZE_DO_CODERS 8196 8197 #ifdef stbir__decode_suffix 8198 #define STBIR__CODER_NAME( name ) STBIR_strs_join1( name, _, stbir__decode_suffix ) 8199 #else 8200 #define STBIR__CODER_NAME( name ) name 8201 #endif 8202 8203 #ifdef stbir__decode_swizzle 8204 #define stbir__decode_simdf8_flip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3),stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg) 8205 #define stbir__decode_simdf4_flip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg) 8206 #define stbir__encode_simdf8_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3),stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg) 8207 #define stbir__encode_simdf4_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg) 8208 #else 8209 #define stbir__decode_order0 0 8210 #define stbir__decode_order1 1 8211 #define stbir__decode_order2 2 8212 #define stbir__decode_order3 3 8213 #define stbir__encode_order0 0 8214 #define stbir__encode_order1 1 8215 #define stbir__encode_order2 2 8216 #define stbir__encode_order3 3 8217 #define stbir__decode_simdf8_flip(reg) 8218 #define stbir__decode_simdf4_flip(reg) 8219 #define stbir__encode_simdf8_unflip(reg) 8220 #define stbir__encode_simdf4_unflip(reg) 8221 #endif 8222 8223 #ifdef STBIR_SIMD8 8224 #define stbir__encode_simdfX_unflip stbir__encode_simdf8_unflip 8225 #else 8226 #define stbir__encode_simdfX_unflip stbir__encode_simdf4_unflip 8227 #endif 8228 8229 static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * decodep, int width_times_channels, void const * inputp ) 8230 { 8231 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 8232 float * decode_end = (float*) decode + width_times_channels; 8233 unsigned char const * input = (unsigned char const*)inputp; 8234 8235 #ifdef STBIR_SIMD 8236 unsigned char const * end_input_m16 = input + width_times_channels - 16; 8237 if ( width_times_channels >= 16 ) 8238 { 8239 decode_end -= 16; 8240 STBIR_NO_UNROLL_LOOP_START_INF_FOR 8241 for(;;) 8242 { 8243 #ifdef STBIR_SIMD8 8244 stbir__simdi i; stbir__simdi8 o0,o1; 8245 stbir__simdf8 of0, of1; 8246 STBIR_NO_UNROLL(decode); 8247 stbir__simdi_load( i, input ); 8248 stbir__simdi8_expand_u8_to_u32( o0, o1, i ); 8249 stbir__simdi8_convert_i32_to_float( of0, o0 ); 8250 stbir__simdi8_convert_i32_to_float( of1, o1 ); 8251 stbir__simdf8_mult( of0, of0, STBIR_max_uint8_as_float_inverted8); 8252 stbir__simdf8_mult( of1, of1, STBIR_max_uint8_as_float_inverted8); 8253 stbir__decode_simdf8_flip( of0 ); 8254 stbir__decode_simdf8_flip( of1 ); 8255 stbir__simdf8_store( decode + 0, of0 ); 8256 stbir__simdf8_store( decode + 8, of1 ); 8257 #else 8258 stbir__simdi i, o0, o1, o2, o3; 8259 stbir__simdf of0, of1, of2, of3; 8260 STBIR_NO_UNROLL(decode); 8261 stbir__simdi_load( i, input ); 8262 stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i); 8263 stbir__simdi_convert_i32_to_float( of0, o0 ); 8264 stbir__simdi_convert_i32_to_float( of1, o1 ); 8265 stbir__simdi_convert_i32_to_float( of2, o2 ); 8266 stbir__simdi_convert_i32_to_float( of3, o3 ); 8267 stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) ); 8268 stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) ); 8269 stbir__simdf_mult( of2, of2, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) ); 8270 stbir__simdf_mult( of3, of3, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) ); 8271 stbir__decode_simdf4_flip( of0 ); 8272 stbir__decode_simdf4_flip( of1 ); 8273 stbir__decode_simdf4_flip( of2 ); 8274 stbir__decode_simdf4_flip( of3 ); 8275 stbir__simdf_store( decode + 0, of0 ); 8276 stbir__simdf_store( decode + 4, of1 ); 8277 stbir__simdf_store( decode + 8, of2 ); 8278 stbir__simdf_store( decode + 12, of3 ); 8279 #endif 8280 decode += 16; 8281 input += 16; 8282 if ( decode <= decode_end ) 8283 continue; 8284 if ( decode == ( decode_end + 16 ) ) 8285 break; 8286 decode = decode_end; // backup and do last couple 8287 input = end_input_m16; 8288 } 8289 return; 8290 } 8291 #endif 8292 8293 // try to do blocks of 4 when you can 8294 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8295 decode += 4; 8296 STBIR_SIMD_NO_UNROLL_LOOP_START 8297 while( decode <= decode_end ) 8298 { 8299 STBIR_SIMD_NO_UNROLL(decode); 8300 decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted; 8301 decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted; 8302 decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted; 8303 decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint8_as_float_inverted; 8304 decode += 4; 8305 input += 4; 8306 } 8307 decode -= 4; 8308 #endif 8309 8310 // do the remnants 8311 #if stbir__coder_min_num < 4 8312 STBIR_NO_UNROLL_LOOP_START 8313 while( decode < decode_end ) 8314 { 8315 STBIR_NO_UNROLL(decode); 8316 decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted; 8317 #if stbir__coder_min_num >= 2 8318 decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted; 8319 #endif 8320 #if stbir__coder_min_num >= 3 8321 decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted; 8322 #endif 8323 decode += stbir__coder_min_num; 8324 input += stbir__coder_min_num; 8325 } 8326 #endif 8327 } 8328 8329 static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outputp, int width_times_channels, float const * encode ) 8330 { 8331 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp; 8332 unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels; 8333 8334 #ifdef STBIR_SIMD 8335 if ( width_times_channels >= stbir__simdfX_float_count*2 ) 8336 { 8337 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; 8338 end_output -= stbir__simdfX_float_count*2; 8339 STBIR_NO_UNROLL_LOOP_START_INF_FOR 8340 for(;;) 8341 { 8342 stbir__simdfX e0, e1; 8343 stbir__simdi i; 8344 STBIR_SIMD_NO_UNROLL(encode); 8345 stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode ); 8346 stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode+stbir__simdfX_float_count ); 8347 stbir__encode_simdfX_unflip( e0 ); 8348 stbir__encode_simdfX_unflip( e1 ); 8349 #ifdef STBIR_SIMD8 8350 stbir__simdf8_pack_to_16bytes( i, e0, e1 ); 8351 stbir__simdi_store( output, i ); 8352 #else 8353 stbir__simdf_pack_to_8bytes( i, e0, e1 ); 8354 stbir__simdi_store2( output, i ); 8355 #endif 8356 encode += stbir__simdfX_float_count*2; 8357 output += stbir__simdfX_float_count*2; 8358 if ( output <= end_output ) 8359 continue; 8360 if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) 8361 break; 8362 output = end_output; // backup and do last couple 8363 encode = end_encode_m8; 8364 } 8365 return; 8366 } 8367 8368 // try to do blocks of 4 when you can 8369 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8370 output += 4; 8371 STBIR_NO_UNROLL_LOOP_START 8372 while( output <= end_output ) 8373 { 8374 stbir__simdf e0; 8375 stbir__simdi i0; 8376 STBIR_NO_UNROLL(encode); 8377 stbir__simdf_load( e0, encode ); 8378 stbir__simdf_madd( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), e0 ); 8379 stbir__encode_simdf4_unflip( e0 ); 8380 stbir__simdf_pack_to_8bytes( i0, e0, e0 ); // only use first 4 8381 *(int*)(output-4) = stbir__simdi_to_int( i0 ); 8382 output += 4; 8383 encode += 4; 8384 } 8385 output -= 4; 8386 #endif 8387 8388 // do the remnants 8389 #if stbir__coder_min_num < 4 8390 STBIR_NO_UNROLL_LOOP_START 8391 while( output < end_output ) 8392 { 8393 stbir__simdf e0; 8394 STBIR_NO_UNROLL(encode); 8395 stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_uint8( e0 ); 8396 #if stbir__coder_min_num >= 2 8397 stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_uint8( e0 ); 8398 #endif 8399 #if stbir__coder_min_num >= 3 8400 stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_uint8( e0 ); 8401 #endif 8402 output += stbir__coder_min_num; 8403 encode += stbir__coder_min_num; 8404 } 8405 #endif 8406 8407 #else 8408 8409 // try to do blocks of 4 when you can 8410 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8411 output += 4; 8412 while( output <= end_output ) 8413 { 8414 float f; 8415 f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f; 8416 f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f; 8417 f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f; 8418 f = encode[stbir__encode_order3] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f; 8419 output += 4; 8420 encode += 4; 8421 } 8422 output -= 4; 8423 #endif 8424 8425 // do the remnants 8426 #if stbir__coder_min_num < 4 8427 STBIR_NO_UNROLL_LOOP_START 8428 while( output < end_output ) 8429 { 8430 float f; 8431 STBIR_NO_UNROLL(encode); 8432 f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f; 8433 #if stbir__coder_min_num >= 2 8434 f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f; 8435 #endif 8436 #if stbir__coder_min_num >= 3 8437 f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f; 8438 #endif 8439 output += stbir__coder_min_num; 8440 encode += stbir__coder_min_num; 8441 } 8442 #endif 8443 #endif 8444 } 8445 8446 static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int width_times_channels, void const * inputp ) 8447 { 8448 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 8449 float * decode_end = (float*) decode + width_times_channels; 8450 unsigned char const * input = (unsigned char const*)inputp; 8451 8452 #ifdef STBIR_SIMD 8453 unsigned char const * end_input_m16 = input + width_times_channels - 16; 8454 if ( width_times_channels >= 16 ) 8455 { 8456 decode_end -= 16; 8457 STBIR_NO_UNROLL_LOOP_START_INF_FOR 8458 for(;;) 8459 { 8460 #ifdef STBIR_SIMD8 8461 stbir__simdi i; stbir__simdi8 o0,o1; 8462 stbir__simdf8 of0, of1; 8463 STBIR_NO_UNROLL(decode); 8464 stbir__simdi_load( i, input ); 8465 stbir__simdi8_expand_u8_to_u32( o0, o1, i ); 8466 stbir__simdi8_convert_i32_to_float( of0, o0 ); 8467 stbir__simdi8_convert_i32_to_float( of1, o1 ); 8468 stbir__decode_simdf8_flip( of0 ); 8469 stbir__decode_simdf8_flip( of1 ); 8470 stbir__simdf8_store( decode + 0, of0 ); 8471 stbir__simdf8_store( decode + 8, of1 ); 8472 #else 8473 stbir__simdi i, o0, o1, o2, o3; 8474 stbir__simdf of0, of1, of2, of3; 8475 STBIR_NO_UNROLL(decode); 8476 stbir__simdi_load( i, input ); 8477 stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i); 8478 stbir__simdi_convert_i32_to_float( of0, o0 ); 8479 stbir__simdi_convert_i32_to_float( of1, o1 ); 8480 stbir__simdi_convert_i32_to_float( of2, o2 ); 8481 stbir__simdi_convert_i32_to_float( of3, o3 ); 8482 stbir__decode_simdf4_flip( of0 ); 8483 stbir__decode_simdf4_flip( of1 ); 8484 stbir__decode_simdf4_flip( of2 ); 8485 stbir__decode_simdf4_flip( of3 ); 8486 stbir__simdf_store( decode + 0, of0 ); 8487 stbir__simdf_store( decode + 4, of1 ); 8488 stbir__simdf_store( decode + 8, of2 ); 8489 stbir__simdf_store( decode + 12, of3 ); 8490 #endif 8491 decode += 16; 8492 input += 16; 8493 if ( decode <= decode_end ) 8494 continue; 8495 if ( decode == ( decode_end + 16 ) ) 8496 break; 8497 decode = decode_end; // backup and do last couple 8498 input = end_input_m16; 8499 } 8500 return; 8501 } 8502 #endif 8503 8504 // try to do blocks of 4 when you can 8505 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8506 decode += 4; 8507 STBIR_SIMD_NO_UNROLL_LOOP_START 8508 while( decode <= decode_end ) 8509 { 8510 STBIR_SIMD_NO_UNROLL(decode); 8511 decode[0-4] = ((float)(input[stbir__decode_order0])); 8512 decode[1-4] = ((float)(input[stbir__decode_order1])); 8513 decode[2-4] = ((float)(input[stbir__decode_order2])); 8514 decode[3-4] = ((float)(input[stbir__decode_order3])); 8515 decode += 4; 8516 input += 4; 8517 } 8518 decode -= 4; 8519 #endif 8520 8521 // do the remnants 8522 #if stbir__coder_min_num < 4 8523 STBIR_NO_UNROLL_LOOP_START 8524 while( decode < decode_end ) 8525 { 8526 STBIR_NO_UNROLL(decode); 8527 decode[0] = ((float)(input[stbir__decode_order0])); 8528 #if stbir__coder_min_num >= 2 8529 decode[1] = ((float)(input[stbir__decode_order1])); 8530 #endif 8531 #if stbir__coder_min_num >= 3 8532 decode[2] = ((float)(input[stbir__decode_order2])); 8533 #endif 8534 decode += stbir__coder_min_num; 8535 input += stbir__coder_min_num; 8536 } 8537 #endif 8538 } 8539 8540 static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int width_times_channels, float const * encode ) 8541 { 8542 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp; 8543 unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels; 8544 8545 #ifdef STBIR_SIMD 8546 if ( width_times_channels >= stbir__simdfX_float_count*2 ) 8547 { 8548 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; 8549 end_output -= stbir__simdfX_float_count*2; 8550 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 8551 for(;;) 8552 { 8553 stbir__simdfX e0, e1; 8554 stbir__simdi i; 8555 STBIR_SIMD_NO_UNROLL(encode); 8556 stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode ); 8557 stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count ); 8558 stbir__encode_simdfX_unflip( e0 ); 8559 stbir__encode_simdfX_unflip( e1 ); 8560 #ifdef STBIR_SIMD8 8561 stbir__simdf8_pack_to_16bytes( i, e0, e1 ); 8562 stbir__simdi_store( output, i ); 8563 #else 8564 stbir__simdf_pack_to_8bytes( i, e0, e1 ); 8565 stbir__simdi_store2( output, i ); 8566 #endif 8567 encode += stbir__simdfX_float_count*2; 8568 output += stbir__simdfX_float_count*2; 8569 if ( output <= end_output ) 8570 continue; 8571 if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) 8572 break; 8573 output = end_output; // backup and do last couple 8574 encode = end_encode_m8; 8575 } 8576 return; 8577 } 8578 8579 // try to do blocks of 4 when you can 8580 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8581 output += 4; 8582 STBIR_NO_UNROLL_LOOP_START 8583 while( output <= end_output ) 8584 { 8585 stbir__simdf e0; 8586 stbir__simdi i0; 8587 STBIR_NO_UNROLL(encode); 8588 stbir__simdf_load( e0, encode ); 8589 stbir__simdf_add( e0, STBIR__CONSTF(STBIR_simd_point5), e0 ); 8590 stbir__encode_simdf4_unflip( e0 ); 8591 stbir__simdf_pack_to_8bytes( i0, e0, e0 ); // only use first 4 8592 *(int*)(output-4) = stbir__simdi_to_int( i0 ); 8593 output += 4; 8594 encode += 4; 8595 } 8596 output -= 4; 8597 #endif 8598 8599 #else 8600 8601 // try to do blocks of 4 when you can 8602 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8603 output += 4; 8604 while( output <= end_output ) 8605 { 8606 float f; 8607 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f; 8608 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f; 8609 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f; 8610 f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f; 8611 output += 4; 8612 encode += 4; 8613 } 8614 output -= 4; 8615 #endif 8616 8617 #endif 8618 8619 // do the remnants 8620 #if stbir__coder_min_num < 4 8621 STBIR_NO_UNROLL_LOOP_START 8622 while( output < end_output ) 8623 { 8624 float f; 8625 STBIR_NO_UNROLL(encode); 8626 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f; 8627 #if stbir__coder_min_num >= 2 8628 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f; 8629 #endif 8630 #if stbir__coder_min_num >= 3 8631 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f; 8632 #endif 8633 output += stbir__coder_min_num; 8634 encode += stbir__coder_min_num; 8635 } 8636 #endif 8637 } 8638 8639 static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int width_times_channels, void const * inputp ) 8640 { 8641 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 8642 float const * decode_end = (float*) decode + width_times_channels; 8643 unsigned char const * input = (unsigned char const *)inputp; 8644 8645 // try to do blocks of 4 when you can 8646 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8647 decode += 4; 8648 while( decode <= decode_end ) 8649 { 8650 decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ]; 8651 decode[1-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ]; 8652 decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ]; 8653 decode[3-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order3 ] ]; 8654 decode += 4; 8655 input += 4; 8656 } 8657 decode -= 4; 8658 #endif 8659 8660 // do the remnants 8661 #if stbir__coder_min_num < 4 8662 STBIR_NO_UNROLL_LOOP_START 8663 while( decode < decode_end ) 8664 { 8665 STBIR_NO_UNROLL(decode); 8666 decode[0] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ]; 8667 #if stbir__coder_min_num >= 2 8668 decode[1] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ]; 8669 #endif 8670 #if stbir__coder_min_num >= 3 8671 decode[2] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ]; 8672 #endif 8673 decode += stbir__coder_min_num; 8674 input += stbir__coder_min_num; 8675 } 8676 #endif 8677 } 8678 8679 #define stbir__min_max_shift20( i, f ) \ 8680 stbir__simdf_max( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_zero )) ); \ 8681 stbir__simdf_min( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_one )) ); \ 8682 stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 ); 8683 8684 #define stbir__scale_and_convert( i, f ) \ 8685 stbir__simdf_madd( f, STBIR__CONSTF( STBIR_simd_point5 ), STBIR__CONSTF( STBIR_max_uint8_as_float ), f ); \ 8686 stbir__simdf_max( f, f, stbir__simdf_zeroP() ); \ 8687 stbir__simdf_min( f, f, STBIR__CONSTF( STBIR_max_uint8_as_float ) ); \ 8688 stbir__simdf_convert_float_to_i32( i, f ); 8689 8690 #define stbir__linear_to_srgb_finish( i, f ) \ 8691 { \ 8692 stbir__simdi temp; \ 8693 stbir__simdi_32shr( temp, stbir_simdi_castf( f ), 12 ) ; \ 8694 stbir__simdi_and( temp, temp, STBIR__CONSTI(STBIR_mastissa_mask) ); \ 8695 stbir__simdi_or( temp, temp, STBIR__CONSTI(STBIR_topscale) ); \ 8696 stbir__simdi_16madd( i, i, temp ); \ 8697 stbir__simdi_32shr( i, i, 16 ); \ 8698 } 8699 8700 #define stbir__simdi_table_lookup2( v0,v1, table ) \ 8701 { \ 8702 stbir__simdi_u32 temp0,temp1; \ 8703 temp0.m128i_i128 = v0; \ 8704 temp1.m128i_i128 = v1; \ 8705 temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \ 8706 temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \ 8707 v0 = temp0.m128i_i128; \ 8708 v1 = temp1.m128i_i128; \ 8709 } 8710 8711 #define stbir__simdi_table_lookup3( v0,v1,v2, table ) \ 8712 { \ 8713 stbir__simdi_u32 temp0,temp1,temp2; \ 8714 temp0.m128i_i128 = v0; \ 8715 temp1.m128i_i128 = v1; \ 8716 temp2.m128i_i128 = v2; \ 8717 temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \ 8718 temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \ 8719 temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \ 8720 v0 = temp0.m128i_i128; \ 8721 v1 = temp1.m128i_i128; \ 8722 v2 = temp2.m128i_i128; \ 8723 } 8724 8725 #define stbir__simdi_table_lookup4( v0,v1,v2,v3, table ) \ 8726 { \ 8727 stbir__simdi_u32 temp0,temp1,temp2,temp3; \ 8728 temp0.m128i_i128 = v0; \ 8729 temp1.m128i_i128 = v1; \ 8730 temp2.m128i_i128 = v2; \ 8731 temp3.m128i_i128 = v3; \ 8732 temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \ 8733 temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \ 8734 temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \ 8735 temp3.m128i_u32[0] = table[temp3.m128i_i32[0]]; temp3.m128i_u32[1] = table[temp3.m128i_i32[1]]; temp3.m128i_u32[2] = table[temp3.m128i_i32[2]]; temp3.m128i_u32[3] = table[temp3.m128i_i32[3]]; \ 8736 v0 = temp0.m128i_i128; \ 8737 v1 = temp1.m128i_i128; \ 8738 v2 = temp2.m128i_i128; \ 8739 v3 = temp3.m128i_i128; \ 8740 } 8741 8742 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int width_times_channels, float const * encode ) 8743 { 8744 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp; 8745 unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels; 8746 8747 #ifdef STBIR_SIMD 8748 8749 if ( width_times_channels >= 16 ) 8750 { 8751 float const * end_encode_m16 = encode + width_times_channels - 16; 8752 end_output -= 16; 8753 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 8754 for(;;) 8755 { 8756 stbir__simdf f0, f1, f2, f3; 8757 stbir__simdi i0, i1, i2, i3; 8758 STBIR_SIMD_NO_UNROLL(encode); 8759 8760 stbir__simdf_load4_transposed( f0, f1, f2, f3, encode ); 8761 8762 stbir__min_max_shift20( i0, f0 ); 8763 stbir__min_max_shift20( i1, f1 ); 8764 stbir__min_max_shift20( i2, f2 ); 8765 stbir__min_max_shift20( i3, f3 ); 8766 8767 stbir__simdi_table_lookup4( i0, i1, i2, i3, ( fp32_to_srgb8_tab4 - (127-13)*8 ) ); 8768 8769 stbir__linear_to_srgb_finish( i0, f0 ); 8770 stbir__linear_to_srgb_finish( i1, f1 ); 8771 stbir__linear_to_srgb_finish( i2, f2 ); 8772 stbir__linear_to_srgb_finish( i3, f3 ); 8773 8774 stbir__interleave_pack_and_store_16_u8( output, STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) ); 8775 8776 encode += 16; 8777 output += 16; 8778 if ( output <= end_output ) 8779 continue; 8780 if ( output == ( end_output + 16 ) ) 8781 break; 8782 output = end_output; // backup and do last couple 8783 encode = end_encode_m16; 8784 } 8785 return; 8786 } 8787 #endif 8788 8789 // try to do blocks of 4 when you can 8790 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 8791 output += 4; 8792 STBIR_SIMD_NO_UNROLL_LOOP_START 8793 while ( output <= end_output ) 8794 { 8795 STBIR_SIMD_NO_UNROLL(encode); 8796 8797 output[0-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] ); 8798 output[1-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] ); 8799 output[2-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] ); 8800 output[3-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order3] ); 8801 8802 output += 4; 8803 encode += 4; 8804 } 8805 output -= 4; 8806 #endif 8807 8808 // do the remnants 8809 #if stbir__coder_min_num < 4 8810 STBIR_NO_UNROLL_LOOP_START 8811 while( output < end_output ) 8812 { 8813 STBIR_NO_UNROLL(encode); 8814 output[0] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] ); 8815 #if stbir__coder_min_num >= 2 8816 output[1] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] ); 8817 #endif 8818 #if stbir__coder_min_num >= 3 8819 output[2] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] ); 8820 #endif 8821 output += stbir__coder_min_num; 8822 encode += stbir__coder_min_num; 8823 } 8824 #endif 8825 } 8826 8827 #if ( stbir__coder_min_num == 4 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) ) 8828 8829 static void STBIR__CODER_NAME(stbir__decode_uint8_srgb4_linearalpha)( float * decodep, int width_times_channels, void const * inputp ) 8830 { 8831 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 8832 float const * decode_end = (float*) decode + width_times_channels; 8833 unsigned char const * input = (unsigned char const *)inputp; 8834 do { 8835 decode[0] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ]; 8836 decode[1] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order1] ]; 8837 decode[2] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order2] ]; 8838 decode[3] = ( (float) input[stbir__decode_order3] ) * stbir__max_uint8_as_float_inverted; 8839 input += 4; 8840 decode += 4; 8841 } while( decode < decode_end ); 8842 } 8843 8844 8845 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * outputp, int width_times_channels, float const * encode ) 8846 { 8847 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp; 8848 unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels; 8849 8850 #ifdef STBIR_SIMD 8851 8852 if ( width_times_channels >= 16 ) 8853 { 8854 float const * end_encode_m16 = encode + width_times_channels - 16; 8855 end_output -= 16; 8856 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 8857 for(;;) 8858 { 8859 stbir__simdf f0, f1, f2, f3; 8860 stbir__simdi i0, i1, i2, i3; 8861 8862 STBIR_SIMD_NO_UNROLL(encode); 8863 stbir__simdf_load4_transposed( f0, f1, f2, f3, encode ); 8864 8865 stbir__min_max_shift20( i0, f0 ); 8866 stbir__min_max_shift20( i1, f1 ); 8867 stbir__min_max_shift20( i2, f2 ); 8868 stbir__scale_and_convert( i3, f3 ); 8869 8870 stbir__simdi_table_lookup3( i0, i1, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) ); 8871 8872 stbir__linear_to_srgb_finish( i0, f0 ); 8873 stbir__linear_to_srgb_finish( i1, f1 ); 8874 stbir__linear_to_srgb_finish( i2, f2 ); 8875 8876 stbir__interleave_pack_and_store_16_u8( output, STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) ); 8877 8878 output += 16; 8879 encode += 16; 8880 8881 if ( output <= end_output ) 8882 continue; 8883 if ( output == ( end_output + 16 ) ) 8884 break; 8885 output = end_output; // backup and do last couple 8886 encode = end_encode_m16; 8887 } 8888 return; 8889 } 8890 #endif 8891 8892 STBIR_SIMD_NO_UNROLL_LOOP_START 8893 do { 8894 float f; 8895 STBIR_SIMD_NO_UNROLL(encode); 8896 8897 output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] ); 8898 output[stbir__decode_order1] = stbir__linear_to_srgb_uchar( encode[1] ); 8899 output[stbir__decode_order2] = stbir__linear_to_srgb_uchar( encode[2] ); 8900 8901 f = encode[3] * stbir__max_uint8_as_float + 0.5f; 8902 STBIR_CLAMP(f, 0, 255); 8903 output[stbir__decode_order3] = (unsigned char) f; 8904 8905 output += 4; 8906 encode += 4; 8907 } while( output < end_output ); 8908 } 8909 8910 #endif 8911 8912 #if ( stbir__coder_min_num == 2 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) ) 8913 8914 static void STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)( float * decodep, int width_times_channels, void const * inputp ) 8915 { 8916 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 8917 float const * decode_end = (float*) decode + width_times_channels; 8918 unsigned char const * input = (unsigned char const *)inputp; 8919 decode += 4; 8920 while( decode <= decode_end ) 8921 { 8922 decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ]; 8923 decode[1-4] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted; 8924 decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0+2] ]; 8925 decode[3-4] = ( (float) input[stbir__decode_order1+2] ) * stbir__max_uint8_as_float_inverted; 8926 input += 4; 8927 decode += 4; 8928 } 8929 decode -= 4; 8930 if( decode < decode_end ) 8931 { 8932 decode[0] = stbir__srgb_uchar_to_linear_float[ stbir__decode_order0 ]; 8933 decode[1] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted; 8934 } 8935 } 8936 8937 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * outputp, int width_times_channels, float const * encode ) 8938 { 8939 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp; 8940 unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels; 8941 8942 #ifdef STBIR_SIMD 8943 8944 if ( width_times_channels >= 16 ) 8945 { 8946 float const * end_encode_m16 = encode + width_times_channels - 16; 8947 end_output -= 16; 8948 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 8949 for(;;) 8950 { 8951 stbir__simdf f0, f1, f2, f3; 8952 stbir__simdi i0, i1, i2, i3; 8953 8954 STBIR_SIMD_NO_UNROLL(encode); 8955 stbir__simdf_load4_transposed( f0, f1, f2, f3, encode ); 8956 8957 stbir__min_max_shift20( i0, f0 ); 8958 stbir__scale_and_convert( i1, f1 ); 8959 stbir__min_max_shift20( i2, f2 ); 8960 stbir__scale_and_convert( i3, f3 ); 8961 8962 stbir__simdi_table_lookup2( i0, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) ); 8963 8964 stbir__linear_to_srgb_finish( i0, f0 ); 8965 stbir__linear_to_srgb_finish( i2, f2 ); 8966 8967 stbir__interleave_pack_and_store_16_u8( output, STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) ); 8968 8969 output += 16; 8970 encode += 16; 8971 if ( output <= end_output ) 8972 continue; 8973 if ( output == ( end_output + 16 ) ) 8974 break; 8975 output = end_output; // backup and do last couple 8976 encode = end_encode_m16; 8977 } 8978 return; 8979 } 8980 #endif 8981 8982 STBIR_SIMD_NO_UNROLL_LOOP_START 8983 do { 8984 float f; 8985 STBIR_SIMD_NO_UNROLL(encode); 8986 8987 output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] ); 8988 8989 f = encode[1] * stbir__max_uint8_as_float + 0.5f; 8990 STBIR_CLAMP(f, 0, 255); 8991 output[stbir__decode_order1] = (unsigned char) f; 8992 8993 output += 2; 8994 encode += 2; 8995 } while( output < end_output ); 8996 } 8997 8998 #endif 8999 9000 static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decodep, int width_times_channels, void const * inputp ) 9001 { 9002 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 9003 float * decode_end = (float*) decode + width_times_channels; 9004 unsigned short const * input = (unsigned short const *)inputp; 9005 9006 #ifdef STBIR_SIMD 9007 unsigned short const * end_input_m8 = input + width_times_channels - 8; 9008 if ( width_times_channels >= 8 ) 9009 { 9010 decode_end -= 8; 9011 STBIR_NO_UNROLL_LOOP_START_INF_FOR 9012 for(;;) 9013 { 9014 #ifdef STBIR_SIMD8 9015 stbir__simdi i; stbir__simdi8 o; 9016 stbir__simdf8 of; 9017 STBIR_NO_UNROLL(decode); 9018 stbir__simdi_load( i, input ); 9019 stbir__simdi8_expand_u16_to_u32( o, i ); 9020 stbir__simdi8_convert_i32_to_float( of, o ); 9021 stbir__simdf8_mult( of, of, STBIR_max_uint16_as_float_inverted8); 9022 stbir__decode_simdf8_flip( of ); 9023 stbir__simdf8_store( decode + 0, of ); 9024 #else 9025 stbir__simdi i, o0, o1; 9026 stbir__simdf of0, of1; 9027 STBIR_NO_UNROLL(decode); 9028 stbir__simdi_load( i, input ); 9029 stbir__simdi_expand_u16_to_u32( o0,o1,i ); 9030 stbir__simdi_convert_i32_to_float( of0, o0 ); 9031 stbir__simdi_convert_i32_to_float( of1, o1 ); 9032 stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted) ); 9033 stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted)); 9034 stbir__decode_simdf4_flip( of0 ); 9035 stbir__decode_simdf4_flip( of1 ); 9036 stbir__simdf_store( decode + 0, of0 ); 9037 stbir__simdf_store( decode + 4, of1 ); 9038 #endif 9039 decode += 8; 9040 input += 8; 9041 if ( decode <= decode_end ) 9042 continue; 9043 if ( decode == ( decode_end + 8 ) ) 9044 break; 9045 decode = decode_end; // backup and do last couple 9046 input = end_input_m8; 9047 } 9048 return; 9049 } 9050 #endif 9051 9052 // try to do blocks of 4 when you can 9053 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9054 decode += 4; 9055 STBIR_SIMD_NO_UNROLL_LOOP_START 9056 while( decode <= decode_end ) 9057 { 9058 STBIR_SIMD_NO_UNROLL(decode); 9059 decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted; 9060 decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted; 9061 decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted; 9062 decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint16_as_float_inverted; 9063 decode += 4; 9064 input += 4; 9065 } 9066 decode -= 4; 9067 #endif 9068 9069 // do the remnants 9070 #if stbir__coder_min_num < 4 9071 STBIR_NO_UNROLL_LOOP_START 9072 while( decode < decode_end ) 9073 { 9074 STBIR_NO_UNROLL(decode); 9075 decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted; 9076 #if stbir__coder_min_num >= 2 9077 decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted; 9078 #endif 9079 #if stbir__coder_min_num >= 3 9080 decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted; 9081 #endif 9082 decode += stbir__coder_min_num; 9083 input += stbir__coder_min_num; 9084 } 9085 #endif 9086 } 9087 9088 9089 static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * outputp, int width_times_channels, float const * encode ) 9090 { 9091 unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp; 9092 unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels; 9093 9094 #ifdef STBIR_SIMD 9095 { 9096 if ( width_times_channels >= stbir__simdfX_float_count*2 ) 9097 { 9098 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; 9099 end_output -= stbir__simdfX_float_count*2; 9100 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 9101 for(;;) 9102 { 9103 stbir__simdfX e0, e1; 9104 stbir__simdiX i; 9105 STBIR_SIMD_NO_UNROLL(encode); 9106 stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode ); 9107 stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode+stbir__simdfX_float_count ); 9108 stbir__encode_simdfX_unflip( e0 ); 9109 stbir__encode_simdfX_unflip( e1 ); 9110 stbir__simdfX_pack_to_words( i, e0, e1 ); 9111 stbir__simdiX_store( output, i ); 9112 encode += stbir__simdfX_float_count*2; 9113 output += stbir__simdfX_float_count*2; 9114 if ( output <= end_output ) 9115 continue; 9116 if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) 9117 break; 9118 output = end_output; // backup and do last couple 9119 encode = end_encode_m8; 9120 } 9121 return; 9122 } 9123 } 9124 9125 // try to do blocks of 4 when you can 9126 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9127 output += 4; 9128 STBIR_NO_UNROLL_LOOP_START 9129 while( output <= end_output ) 9130 { 9131 stbir__simdf e; 9132 stbir__simdi i; 9133 STBIR_NO_UNROLL(encode); 9134 stbir__simdf_load( e, encode ); 9135 stbir__simdf_madd( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), e ); 9136 stbir__encode_simdf4_unflip( e ); 9137 stbir__simdf_pack_to_8words( i, e, e ); // only use first 4 9138 stbir__simdi_store2( output-4, i ); 9139 output += 4; 9140 encode += 4; 9141 } 9142 output -= 4; 9143 #endif 9144 9145 // do the remnants 9146 #if stbir__coder_min_num < 4 9147 STBIR_NO_UNROLL_LOOP_START 9148 while( output < end_output ) 9149 { 9150 stbir__simdf e; 9151 STBIR_NO_UNROLL(encode); 9152 stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_short( e ); 9153 #if stbir__coder_min_num >= 2 9154 stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_short( e ); 9155 #endif 9156 #if stbir__coder_min_num >= 3 9157 stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_short( e ); 9158 #endif 9159 output += stbir__coder_min_num; 9160 encode += stbir__coder_min_num; 9161 } 9162 #endif 9163 9164 #else 9165 9166 // try to do blocks of 4 when you can 9167 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9168 output += 4; 9169 STBIR_SIMD_NO_UNROLL_LOOP_START 9170 while( output <= end_output ) 9171 { 9172 float f; 9173 STBIR_SIMD_NO_UNROLL(encode); 9174 f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f; 9175 f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f; 9176 f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f; 9177 f = encode[stbir__encode_order3] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f; 9178 output += 4; 9179 encode += 4; 9180 } 9181 output -= 4; 9182 #endif 9183 9184 // do the remnants 9185 #if stbir__coder_min_num < 4 9186 STBIR_NO_UNROLL_LOOP_START 9187 while( output < end_output ) 9188 { 9189 float f; 9190 STBIR_NO_UNROLL(encode); 9191 f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f; 9192 #if stbir__coder_min_num >= 2 9193 f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f; 9194 #endif 9195 #if stbir__coder_min_num >= 3 9196 f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f; 9197 #endif 9198 output += stbir__coder_min_num; 9199 encode += stbir__coder_min_num; 9200 } 9201 #endif 9202 #endif 9203 } 9204 9205 static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int width_times_channels, void const * inputp ) 9206 { 9207 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 9208 float * decode_end = (float*) decode + width_times_channels; 9209 unsigned short const * input = (unsigned short const *)inputp; 9210 9211 #ifdef STBIR_SIMD 9212 unsigned short const * end_input_m8 = input + width_times_channels - 8; 9213 if ( width_times_channels >= 8 ) 9214 { 9215 decode_end -= 8; 9216 STBIR_NO_UNROLL_LOOP_START_INF_FOR 9217 for(;;) 9218 { 9219 #ifdef STBIR_SIMD8 9220 stbir__simdi i; stbir__simdi8 o; 9221 stbir__simdf8 of; 9222 STBIR_NO_UNROLL(decode); 9223 stbir__simdi_load( i, input ); 9224 stbir__simdi8_expand_u16_to_u32( o, i ); 9225 stbir__simdi8_convert_i32_to_float( of, o ); 9226 stbir__decode_simdf8_flip( of ); 9227 stbir__simdf8_store( decode + 0, of ); 9228 #else 9229 stbir__simdi i, o0, o1; 9230 stbir__simdf of0, of1; 9231 STBIR_NO_UNROLL(decode); 9232 stbir__simdi_load( i, input ); 9233 stbir__simdi_expand_u16_to_u32( o0, o1, i ); 9234 stbir__simdi_convert_i32_to_float( of0, o0 ); 9235 stbir__simdi_convert_i32_to_float( of1, o1 ); 9236 stbir__decode_simdf4_flip( of0 ); 9237 stbir__decode_simdf4_flip( of1 ); 9238 stbir__simdf_store( decode + 0, of0 ); 9239 stbir__simdf_store( decode + 4, of1 ); 9240 #endif 9241 decode += 8; 9242 input += 8; 9243 if ( decode <= decode_end ) 9244 continue; 9245 if ( decode == ( decode_end + 8 ) ) 9246 break; 9247 decode = decode_end; // backup and do last couple 9248 input = end_input_m8; 9249 } 9250 return; 9251 } 9252 #endif 9253 9254 // try to do blocks of 4 when you can 9255 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9256 decode += 4; 9257 STBIR_SIMD_NO_UNROLL_LOOP_START 9258 while( decode <= decode_end ) 9259 { 9260 STBIR_SIMD_NO_UNROLL(decode); 9261 decode[0-4] = ((float)(input[stbir__decode_order0])); 9262 decode[1-4] = ((float)(input[stbir__decode_order1])); 9263 decode[2-4] = ((float)(input[stbir__decode_order2])); 9264 decode[3-4] = ((float)(input[stbir__decode_order3])); 9265 decode += 4; 9266 input += 4; 9267 } 9268 decode -= 4; 9269 #endif 9270 9271 // do the remnants 9272 #if stbir__coder_min_num < 4 9273 STBIR_NO_UNROLL_LOOP_START 9274 while( decode < decode_end ) 9275 { 9276 STBIR_NO_UNROLL(decode); 9277 decode[0] = ((float)(input[stbir__decode_order0])); 9278 #if stbir__coder_min_num >= 2 9279 decode[1] = ((float)(input[stbir__decode_order1])); 9280 #endif 9281 #if stbir__coder_min_num >= 3 9282 decode[2] = ((float)(input[stbir__decode_order2])); 9283 #endif 9284 decode += stbir__coder_min_num; 9285 input += stbir__coder_min_num; 9286 } 9287 #endif 9288 } 9289 9290 static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int width_times_channels, float const * encode ) 9291 { 9292 unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp; 9293 unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels; 9294 9295 #ifdef STBIR_SIMD 9296 { 9297 if ( width_times_channels >= stbir__simdfX_float_count*2 ) 9298 { 9299 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; 9300 end_output -= stbir__simdfX_float_count*2; 9301 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 9302 for(;;) 9303 { 9304 stbir__simdfX e0, e1; 9305 stbir__simdiX i; 9306 STBIR_SIMD_NO_UNROLL(encode); 9307 stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode ); 9308 stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count ); 9309 stbir__encode_simdfX_unflip( e0 ); 9310 stbir__encode_simdfX_unflip( e1 ); 9311 stbir__simdfX_pack_to_words( i, e0, e1 ); 9312 stbir__simdiX_store( output, i ); 9313 encode += stbir__simdfX_float_count*2; 9314 output += stbir__simdfX_float_count*2; 9315 if ( output <= end_output ) 9316 continue; 9317 if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) 9318 break; 9319 output = end_output; // backup and do last couple 9320 encode = end_encode_m8; 9321 } 9322 return; 9323 } 9324 } 9325 9326 // try to do blocks of 4 when you can 9327 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9328 output += 4; 9329 STBIR_NO_UNROLL_LOOP_START 9330 while( output <= end_output ) 9331 { 9332 stbir__simdf e; 9333 stbir__simdi i; 9334 STBIR_NO_UNROLL(encode); 9335 stbir__simdf_load( e, encode ); 9336 stbir__simdf_add( e, STBIR__CONSTF(STBIR_simd_point5), e ); 9337 stbir__encode_simdf4_unflip( e ); 9338 stbir__simdf_pack_to_8words( i, e, e ); // only use first 4 9339 stbir__simdi_store2( output-4, i ); 9340 output += 4; 9341 encode += 4; 9342 } 9343 output -= 4; 9344 #endif 9345 9346 #else 9347 9348 // try to do blocks of 4 when you can 9349 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9350 output += 4; 9351 STBIR_SIMD_NO_UNROLL_LOOP_START 9352 while( output <= end_output ) 9353 { 9354 float f; 9355 STBIR_SIMD_NO_UNROLL(encode); 9356 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f; 9357 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f; 9358 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f; 9359 f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f; 9360 output += 4; 9361 encode += 4; 9362 } 9363 output -= 4; 9364 #endif 9365 9366 #endif 9367 9368 // do the remnants 9369 #if stbir__coder_min_num < 4 9370 STBIR_NO_UNROLL_LOOP_START 9371 while( output < end_output ) 9372 { 9373 float f; 9374 STBIR_NO_UNROLL(encode); 9375 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f; 9376 #if stbir__coder_min_num >= 2 9377 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f; 9378 #endif 9379 #if stbir__coder_min_num >= 3 9380 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f; 9381 #endif 9382 output += stbir__coder_min_num; 9383 encode += stbir__coder_min_num; 9384 } 9385 #endif 9386 } 9387 9388 static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, int width_times_channels, void const * inputp ) 9389 { 9390 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 9391 float * decode_end = (float*) decode + width_times_channels; 9392 stbir__FP16 const * input = (stbir__FP16 const *)inputp; 9393 9394 #ifdef STBIR_SIMD 9395 if ( width_times_channels >= 8 ) 9396 { 9397 stbir__FP16 const * end_input_m8 = input + width_times_channels - 8; 9398 decode_end -= 8; 9399 STBIR_NO_UNROLL_LOOP_START_INF_FOR 9400 for(;;) 9401 { 9402 STBIR_NO_UNROLL(decode); 9403 9404 stbir__half_to_float_SIMD( decode, input ); 9405 #ifdef stbir__decode_swizzle 9406 #ifdef STBIR_SIMD8 9407 { 9408 stbir__simdf8 of; 9409 stbir__simdf8_load( of, decode ); 9410 stbir__decode_simdf8_flip( of ); 9411 stbir__simdf8_store( decode, of ); 9412 } 9413 #else 9414 { 9415 stbir__simdf of0,of1; 9416 stbir__simdf_load( of0, decode ); 9417 stbir__simdf_load( of1, decode+4 ); 9418 stbir__decode_simdf4_flip( of0 ); 9419 stbir__decode_simdf4_flip( of1 ); 9420 stbir__simdf_store( decode, of0 ); 9421 stbir__simdf_store( decode+4, of1 ); 9422 } 9423 #endif 9424 #endif 9425 decode += 8; 9426 input += 8; 9427 if ( decode <= decode_end ) 9428 continue; 9429 if ( decode == ( decode_end + 8 ) ) 9430 break; 9431 decode = decode_end; // backup and do last couple 9432 input = end_input_m8; 9433 } 9434 return; 9435 } 9436 #endif 9437 9438 // try to do blocks of 4 when you can 9439 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9440 decode += 4; 9441 STBIR_SIMD_NO_UNROLL_LOOP_START 9442 while( decode <= decode_end ) 9443 { 9444 STBIR_SIMD_NO_UNROLL(decode); 9445 decode[0-4] = stbir__half_to_float(input[stbir__decode_order0]); 9446 decode[1-4] = stbir__half_to_float(input[stbir__decode_order1]); 9447 decode[2-4] = stbir__half_to_float(input[stbir__decode_order2]); 9448 decode[3-4] = stbir__half_to_float(input[stbir__decode_order3]); 9449 decode += 4; 9450 input += 4; 9451 } 9452 decode -= 4; 9453 #endif 9454 9455 // do the remnants 9456 #if stbir__coder_min_num < 4 9457 STBIR_NO_UNROLL_LOOP_START 9458 while( decode < decode_end ) 9459 { 9460 STBIR_NO_UNROLL(decode); 9461 decode[0] = stbir__half_to_float(input[stbir__decode_order0]); 9462 #if stbir__coder_min_num >= 2 9463 decode[1] = stbir__half_to_float(input[stbir__decode_order1]); 9464 #endif 9465 #if stbir__coder_min_num >= 3 9466 decode[2] = stbir__half_to_float(input[stbir__decode_order2]); 9467 #endif 9468 decode += stbir__coder_min_num; 9469 input += stbir__coder_min_num; 9470 } 9471 #endif 9472 } 9473 9474 static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp, int width_times_channels, float const * encode ) 9475 { 9476 stbir__FP16 STBIR_SIMD_STREAMOUT_PTR( * ) output = (stbir__FP16*) outputp; 9477 stbir__FP16 * end_output = ( (stbir__FP16*) output ) + width_times_channels; 9478 9479 #ifdef STBIR_SIMD 9480 if ( width_times_channels >= 8 ) 9481 { 9482 float const * end_encode_m8 = encode + width_times_channels - 8; 9483 end_output -= 8; 9484 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 9485 for(;;) 9486 { 9487 STBIR_SIMD_NO_UNROLL(encode); 9488 #ifdef stbir__decode_swizzle 9489 #ifdef STBIR_SIMD8 9490 { 9491 stbir__simdf8 of; 9492 stbir__simdf8_load( of, encode ); 9493 stbir__encode_simdf8_unflip( of ); 9494 stbir__float_to_half_SIMD( output, (float*)&of ); 9495 } 9496 #else 9497 { 9498 stbir__simdf of[2]; 9499 stbir__simdf_load( of[0], encode ); 9500 stbir__simdf_load( of[1], encode+4 ); 9501 stbir__encode_simdf4_unflip( of[0] ); 9502 stbir__encode_simdf4_unflip( of[1] ); 9503 stbir__float_to_half_SIMD( output, (float*)of ); 9504 } 9505 #endif 9506 #else 9507 stbir__float_to_half_SIMD( output, encode ); 9508 #endif 9509 encode += 8; 9510 output += 8; 9511 if ( output <= end_output ) 9512 continue; 9513 if ( output == ( end_output + 8 ) ) 9514 break; 9515 output = end_output; // backup and do last couple 9516 encode = end_encode_m8; 9517 } 9518 return; 9519 } 9520 #endif 9521 9522 // try to do blocks of 4 when you can 9523 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9524 output += 4; 9525 STBIR_SIMD_NO_UNROLL_LOOP_START 9526 while( output <= end_output ) 9527 { 9528 STBIR_SIMD_NO_UNROLL(output); 9529 output[0-4] = stbir__float_to_half(encode[stbir__encode_order0]); 9530 output[1-4] = stbir__float_to_half(encode[stbir__encode_order1]); 9531 output[2-4] = stbir__float_to_half(encode[stbir__encode_order2]); 9532 output[3-4] = stbir__float_to_half(encode[stbir__encode_order3]); 9533 output += 4; 9534 encode += 4; 9535 } 9536 output -= 4; 9537 #endif 9538 9539 // do the remnants 9540 #if stbir__coder_min_num < 4 9541 STBIR_NO_UNROLL_LOOP_START 9542 while( output < end_output ) 9543 { 9544 STBIR_NO_UNROLL(output); 9545 output[0] = stbir__float_to_half(encode[stbir__encode_order0]); 9546 #if stbir__coder_min_num >= 2 9547 output[1] = stbir__float_to_half(encode[stbir__encode_order1]); 9548 #endif 9549 #if stbir__coder_min_num >= 3 9550 output[2] = stbir__float_to_half(encode[stbir__encode_order2]); 9551 #endif 9552 output += stbir__coder_min_num; 9553 encode += stbir__coder_min_num; 9554 } 9555 #endif 9556 } 9557 9558 static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int width_times_channels, void const * inputp ) 9559 { 9560 #ifdef stbir__decode_swizzle 9561 float STBIR_STREAMOUT_PTR( * ) decode = decodep; 9562 float * decode_end = (float*) decode + width_times_channels; 9563 float const * input = (float const *)inputp; 9564 9565 #ifdef STBIR_SIMD 9566 if ( width_times_channels >= 16 ) 9567 { 9568 float const * end_input_m16 = input + width_times_channels - 16; 9569 decode_end -= 16; 9570 STBIR_NO_UNROLL_LOOP_START_INF_FOR 9571 for(;;) 9572 { 9573 STBIR_NO_UNROLL(decode); 9574 #ifdef stbir__decode_swizzle 9575 #ifdef STBIR_SIMD8 9576 { 9577 stbir__simdf8 of0,of1; 9578 stbir__simdf8_load( of0, input ); 9579 stbir__simdf8_load( of1, input+8 ); 9580 stbir__decode_simdf8_flip( of0 ); 9581 stbir__decode_simdf8_flip( of1 ); 9582 stbir__simdf8_store( decode, of0 ); 9583 stbir__simdf8_store( decode+8, of1 ); 9584 } 9585 #else 9586 { 9587 stbir__simdf of0,of1,of2,of3; 9588 stbir__simdf_load( of0, input ); 9589 stbir__simdf_load( of1, input+4 ); 9590 stbir__simdf_load( of2, input+8 ); 9591 stbir__simdf_load( of3, input+12 ); 9592 stbir__decode_simdf4_flip( of0 ); 9593 stbir__decode_simdf4_flip( of1 ); 9594 stbir__decode_simdf4_flip( of2 ); 9595 stbir__decode_simdf4_flip( of3 ); 9596 stbir__simdf_store( decode, of0 ); 9597 stbir__simdf_store( decode+4, of1 ); 9598 stbir__simdf_store( decode+8, of2 ); 9599 stbir__simdf_store( decode+12, of3 ); 9600 } 9601 #endif 9602 #endif 9603 decode += 16; 9604 input += 16; 9605 if ( decode <= decode_end ) 9606 continue; 9607 if ( decode == ( decode_end + 16 ) ) 9608 break; 9609 decode = decode_end; // backup and do last couple 9610 input = end_input_m16; 9611 } 9612 return; 9613 } 9614 #endif 9615 9616 // try to do blocks of 4 when you can 9617 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9618 decode += 4; 9619 STBIR_SIMD_NO_UNROLL_LOOP_START 9620 while( decode <= decode_end ) 9621 { 9622 STBIR_SIMD_NO_UNROLL(decode); 9623 decode[0-4] = input[stbir__decode_order0]; 9624 decode[1-4] = input[stbir__decode_order1]; 9625 decode[2-4] = input[stbir__decode_order2]; 9626 decode[3-4] = input[stbir__decode_order3]; 9627 decode += 4; 9628 input += 4; 9629 } 9630 decode -= 4; 9631 #endif 9632 9633 // do the remnants 9634 #if stbir__coder_min_num < 4 9635 STBIR_NO_UNROLL_LOOP_START 9636 while( decode < decode_end ) 9637 { 9638 STBIR_NO_UNROLL(decode); 9639 decode[0] = input[stbir__decode_order0]; 9640 #if stbir__coder_min_num >= 2 9641 decode[1] = input[stbir__decode_order1]; 9642 #endif 9643 #if stbir__coder_min_num >= 3 9644 decode[2] = input[stbir__decode_order2]; 9645 #endif 9646 decode += stbir__coder_min_num; 9647 input += stbir__coder_min_num; 9648 } 9649 #endif 9650 9651 #else 9652 9653 if ( (void*)decodep != inputp ) 9654 STBIR_MEMCPY( decodep, inputp, width_times_channels * sizeof( float ) ); 9655 9656 #endif 9657 } 9658 9659 static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int width_times_channels, float const * encode ) 9660 { 9661 #if !defined( STBIR_FLOAT_HIGH_CLAMP ) && !defined(STBIR_FLOAT_LO_CLAMP) && !defined(stbir__decode_swizzle) 9662 9663 if ( (void*)outputp != (void*) encode ) 9664 STBIR_MEMCPY( outputp, encode, width_times_channels * sizeof( float ) ); 9665 9666 #else 9667 9668 float STBIR_SIMD_STREAMOUT_PTR( * ) output = (float*) outputp; 9669 float * end_output = ( (float*) output ) + width_times_channels; 9670 9671 #ifdef STBIR_FLOAT_HIGH_CLAMP 9672 #define stbir_scalar_hi_clamp( v ) if ( v > STBIR_FLOAT_HIGH_CLAMP ) v = STBIR_FLOAT_HIGH_CLAMP; 9673 #else 9674 #define stbir_scalar_hi_clamp( v ) 9675 #endif 9676 #ifdef STBIR_FLOAT_LOW_CLAMP 9677 #define stbir_scalar_lo_clamp( v ) if ( v < STBIR_FLOAT_LOW_CLAMP ) v = STBIR_FLOAT_LOW_CLAMP; 9678 #else 9679 #define stbir_scalar_lo_clamp( v ) 9680 #endif 9681 9682 #ifdef STBIR_SIMD 9683 9684 #ifdef STBIR_FLOAT_HIGH_CLAMP 9685 const stbir__simdfX high_clamp = stbir__simdf_frepX(STBIR_FLOAT_HIGH_CLAMP); 9686 #endif 9687 #ifdef STBIR_FLOAT_LOW_CLAMP 9688 const stbir__simdfX low_clamp = stbir__simdf_frepX(STBIR_FLOAT_LOW_CLAMP); 9689 #endif 9690 9691 if ( width_times_channels >= ( stbir__simdfX_float_count * 2 ) ) 9692 { 9693 float const * end_encode_m8 = encode + width_times_channels - ( stbir__simdfX_float_count * 2 ); 9694 end_output -= ( stbir__simdfX_float_count * 2 ); 9695 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR 9696 for(;;) 9697 { 9698 stbir__simdfX e0, e1; 9699 STBIR_SIMD_NO_UNROLL(encode); 9700 stbir__simdfX_load( e0, encode ); 9701 stbir__simdfX_load( e1, encode+stbir__simdfX_float_count ); 9702 #ifdef STBIR_FLOAT_HIGH_CLAMP 9703 stbir__simdfX_min( e0, e0, high_clamp ); 9704 stbir__simdfX_min( e1, e1, high_clamp ); 9705 #endif 9706 #ifdef STBIR_FLOAT_LOW_CLAMP 9707 stbir__simdfX_max( e0, e0, low_clamp ); 9708 stbir__simdfX_max( e1, e1, low_clamp ); 9709 #endif 9710 stbir__encode_simdfX_unflip( e0 ); 9711 stbir__encode_simdfX_unflip( e1 ); 9712 stbir__simdfX_store( output, e0 ); 9713 stbir__simdfX_store( output+stbir__simdfX_float_count, e1 ); 9714 encode += stbir__simdfX_float_count * 2; 9715 output += stbir__simdfX_float_count * 2; 9716 if ( output < end_output ) 9717 continue; 9718 if ( output == ( end_output + ( stbir__simdfX_float_count * 2 ) ) ) 9719 break; 9720 output = end_output; // backup and do last couple 9721 encode = end_encode_m8; 9722 } 9723 return; 9724 } 9725 9726 // try to do blocks of 4 when you can 9727 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9728 output += 4; 9729 STBIR_NO_UNROLL_LOOP_START 9730 while( output <= end_output ) 9731 { 9732 stbir__simdf e0; 9733 STBIR_NO_UNROLL(encode); 9734 stbir__simdf_load( e0, encode ); 9735 #ifdef STBIR_FLOAT_HIGH_CLAMP 9736 stbir__simdf_min( e0, e0, high_clamp ); 9737 #endif 9738 #ifdef STBIR_FLOAT_LOW_CLAMP 9739 stbir__simdf_max( e0, e0, low_clamp ); 9740 #endif 9741 stbir__encode_simdf4_unflip( e0 ); 9742 stbir__simdf_store( output-4, e0 ); 9743 output += 4; 9744 encode += 4; 9745 } 9746 output -= 4; 9747 #endif 9748 9749 #else 9750 9751 // try to do blocks of 4 when you can 9752 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four 9753 output += 4; 9754 STBIR_SIMD_NO_UNROLL_LOOP_START 9755 while( output <= end_output ) 9756 { 9757 float e; 9758 STBIR_SIMD_NO_UNROLL(encode); 9759 e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0-4] = e; 9760 e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1-4] = e; 9761 e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2-4] = e; 9762 e = encode[ stbir__encode_order3 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[3-4] = e; 9763 output += 4; 9764 encode += 4; 9765 } 9766 output -= 4; 9767 9768 #endif 9769 9770 #endif 9771 9772 // do the remnants 9773 #if stbir__coder_min_num < 4 9774 STBIR_NO_UNROLL_LOOP_START 9775 while( output < end_output ) 9776 { 9777 float e; 9778 STBIR_NO_UNROLL(encode); 9779 e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0] = e; 9780 #if stbir__coder_min_num >= 2 9781 e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1] = e; 9782 #endif 9783 #if stbir__coder_min_num >= 3 9784 e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2] = e; 9785 #endif 9786 output += stbir__coder_min_num; 9787 encode += stbir__coder_min_num; 9788 } 9789 #endif 9790 9791 #endif 9792 } 9793 9794 #undef stbir__decode_suffix 9795 #undef stbir__decode_simdf8_flip 9796 #undef stbir__decode_simdf4_flip 9797 #undef stbir__decode_order0 9798 #undef stbir__decode_order1 9799 #undef stbir__decode_order2 9800 #undef stbir__decode_order3 9801 #undef stbir__encode_order0 9802 #undef stbir__encode_order1 9803 #undef stbir__encode_order2 9804 #undef stbir__encode_order3 9805 #undef stbir__encode_simdf8_unflip 9806 #undef stbir__encode_simdf4_unflip 9807 #undef stbir__encode_simdfX_unflip 9808 #undef STBIR__CODER_NAME 9809 #undef stbir__coder_min_num 9810 #undef stbir__decode_swizzle 9811 #undef stbir_scalar_hi_clamp 9812 #undef stbir_scalar_lo_clamp 9813 #undef STB_IMAGE_RESIZE_DO_CODERS 9814 9815 #elif defined( STB_IMAGE_RESIZE_DO_VERTICALS) 9816 9817 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 9818 #define STBIR_chans( start, end ) STBIR_strs_join14(start,STBIR__vertical_channels,end,_cont) 9819 #else 9820 #define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__vertical_channels,end) 9821 #endif 9822 9823 #if STBIR__vertical_channels >= 1 9824 #define stbIF0( code ) code 9825 #else 9826 #define stbIF0( code ) 9827 #endif 9828 #if STBIR__vertical_channels >= 2 9829 #define stbIF1( code ) code 9830 #else 9831 #define stbIF1( code ) 9832 #endif 9833 #if STBIR__vertical_channels >= 3 9834 #define stbIF2( code ) code 9835 #else 9836 #define stbIF2( code ) 9837 #endif 9838 #if STBIR__vertical_channels >= 4 9839 #define stbIF3( code ) code 9840 #else 9841 #define stbIF3( code ) 9842 #endif 9843 #if STBIR__vertical_channels >= 5 9844 #define stbIF4( code ) code 9845 #else 9846 #define stbIF4( code ) 9847 #endif 9848 #if STBIR__vertical_channels >= 6 9849 #define stbIF5( code ) code 9850 #else 9851 #define stbIF5( code ) 9852 #endif 9853 #if STBIR__vertical_channels >= 7 9854 #define stbIF6( code ) code 9855 #else 9856 #define stbIF6( code ) 9857 #endif 9858 #if STBIR__vertical_channels >= 8 9859 #define stbIF7( code ) code 9860 #else 9861 #define stbIF7( code ) 9862 #endif 9863 9864 static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** outputs, float const * vertical_coefficients, float const * input, float const * input_end ) 9865 { 9866 stbIF0( float STBIR_SIMD_STREAMOUT_PTR( * ) output0 = outputs[0]; float c0s = vertical_coefficients[0]; ) 9867 stbIF1( float STBIR_SIMD_STREAMOUT_PTR( * ) output1 = outputs[1]; float c1s = vertical_coefficients[1]; ) 9868 stbIF2( float STBIR_SIMD_STREAMOUT_PTR( * ) output2 = outputs[2]; float c2s = vertical_coefficients[2]; ) 9869 stbIF3( float STBIR_SIMD_STREAMOUT_PTR( * ) output3 = outputs[3]; float c3s = vertical_coefficients[3]; ) 9870 stbIF4( float STBIR_SIMD_STREAMOUT_PTR( * ) output4 = outputs[4]; float c4s = vertical_coefficients[4]; ) 9871 stbIF5( float STBIR_SIMD_STREAMOUT_PTR( * ) output5 = outputs[5]; float c5s = vertical_coefficients[5]; ) 9872 stbIF6( float STBIR_SIMD_STREAMOUT_PTR( * ) output6 = outputs[6]; float c6s = vertical_coefficients[6]; ) 9873 stbIF7( float STBIR_SIMD_STREAMOUT_PTR( * ) output7 = outputs[7]; float c7s = vertical_coefficients[7]; ) 9874 9875 #ifdef STBIR_SIMD 9876 { 9877 stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); ) 9878 stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); ) 9879 stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); ) 9880 stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); ) 9881 stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); ) 9882 stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); ) 9883 stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); ) 9884 stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); ) 9885 STBIR_SIMD_NO_UNROLL_LOOP_START 9886 while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) ) 9887 { 9888 stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3; 9889 STBIR_SIMD_NO_UNROLL(output0); 9890 9891 stbir__simdfX_load( r0, input ); stbir__simdfX_load( r1, input+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input+(3*stbir__simdfX_float_count) ); 9892 9893 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 9894 stbIF0( stbir__simdfX_load( o0, output0 ); stbir__simdfX_load( o1, output0+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output0+(3*stbir__simdfX_float_count) ); 9895 stbir__simdfX_madd( o0, o0, r0, c0 ); stbir__simdfX_madd( o1, o1, r1, c0 ); stbir__simdfX_madd( o2, o2, r2, c0 ); stbir__simdfX_madd( o3, o3, r3, c0 ); 9896 stbir__simdfX_store( output0, o0 ); stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); ) 9897 stbIF1( stbir__simdfX_load( o0, output1 ); stbir__simdfX_load( o1, output1+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output1+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output1+(3*stbir__simdfX_float_count) ); 9898 stbir__simdfX_madd( o0, o0, r0, c1 ); stbir__simdfX_madd( o1, o1, r1, c1 ); stbir__simdfX_madd( o2, o2, r2, c1 ); stbir__simdfX_madd( o3, o3, r3, c1 ); 9899 stbir__simdfX_store( output1, o0 ); stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); ) 9900 stbIF2( stbir__simdfX_load( o0, output2 ); stbir__simdfX_load( o1, output2+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output2+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output2+(3*stbir__simdfX_float_count) ); 9901 stbir__simdfX_madd( o0, o0, r0, c2 ); stbir__simdfX_madd( o1, o1, r1, c2 ); stbir__simdfX_madd( o2, o2, r2, c2 ); stbir__simdfX_madd( o3, o3, r3, c2 ); 9902 stbir__simdfX_store( output2, o0 ); stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); ) 9903 stbIF3( stbir__simdfX_load( o0, output3 ); stbir__simdfX_load( o1, output3+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output3+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output3+(3*stbir__simdfX_float_count) ); 9904 stbir__simdfX_madd( o0, o0, r0, c3 ); stbir__simdfX_madd( o1, o1, r1, c3 ); stbir__simdfX_madd( o2, o2, r2, c3 ); stbir__simdfX_madd( o3, o3, r3, c3 ); 9905 stbir__simdfX_store( output3, o0 ); stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); ) 9906 stbIF4( stbir__simdfX_load( o0, output4 ); stbir__simdfX_load( o1, output4+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output4+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output4+(3*stbir__simdfX_float_count) ); 9907 stbir__simdfX_madd( o0, o0, r0, c4 ); stbir__simdfX_madd( o1, o1, r1, c4 ); stbir__simdfX_madd( o2, o2, r2, c4 ); stbir__simdfX_madd( o3, o3, r3, c4 ); 9908 stbir__simdfX_store( output4, o0 ); stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); ) 9909 stbIF5( stbir__simdfX_load( o0, output5 ); stbir__simdfX_load( o1, output5+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output5+(2*stbir__simdfX_float_count)); stbir__simdfX_load( o3, output5+(3*stbir__simdfX_float_count) ); 9910 stbir__simdfX_madd( o0, o0, r0, c5 ); stbir__simdfX_madd( o1, o1, r1, c5 ); stbir__simdfX_madd( o2, o2, r2, c5 ); stbir__simdfX_madd( o3, o3, r3, c5 ); 9911 stbir__simdfX_store( output5, o0 ); stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); ) 9912 stbIF6( stbir__simdfX_load( o0, output6 ); stbir__simdfX_load( o1, output6+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output6+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output6+(3*stbir__simdfX_float_count) ); 9913 stbir__simdfX_madd( o0, o0, r0, c6 ); stbir__simdfX_madd( o1, o1, r1, c6 ); stbir__simdfX_madd( o2, o2, r2, c6 ); stbir__simdfX_madd( o3, o3, r3, c6 ); 9914 stbir__simdfX_store( output6, o0 ); stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); ) 9915 stbIF7( stbir__simdfX_load( o0, output7 ); stbir__simdfX_load( o1, output7+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output7+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output7+(3*stbir__simdfX_float_count) ); 9916 stbir__simdfX_madd( o0, o0, r0, c7 ); stbir__simdfX_madd( o1, o1, r1, c7 ); stbir__simdfX_madd( o2, o2, r2, c7 ); stbir__simdfX_madd( o3, o3, r3, c7 ); 9917 stbir__simdfX_store( output7, o0 ); stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); ) 9918 #else 9919 stbIF0( stbir__simdfX_mult( o0, r0, c0 ); stbir__simdfX_mult( o1, r1, c0 ); stbir__simdfX_mult( o2, r2, c0 ); stbir__simdfX_mult( o3, r3, c0 ); 9920 stbir__simdfX_store( output0, o0 ); stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); ) 9921 stbIF1( stbir__simdfX_mult( o0, r0, c1 ); stbir__simdfX_mult( o1, r1, c1 ); stbir__simdfX_mult( o2, r2, c1 ); stbir__simdfX_mult( o3, r3, c1 ); 9922 stbir__simdfX_store( output1, o0 ); stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); ) 9923 stbIF2( stbir__simdfX_mult( o0, r0, c2 ); stbir__simdfX_mult( o1, r1, c2 ); stbir__simdfX_mult( o2, r2, c2 ); stbir__simdfX_mult( o3, r3, c2 ); 9924 stbir__simdfX_store( output2, o0 ); stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); ) 9925 stbIF3( stbir__simdfX_mult( o0, r0, c3 ); stbir__simdfX_mult( o1, r1, c3 ); stbir__simdfX_mult( o2, r2, c3 ); stbir__simdfX_mult( o3, r3, c3 ); 9926 stbir__simdfX_store( output3, o0 ); stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); ) 9927 stbIF4( stbir__simdfX_mult( o0, r0, c4 ); stbir__simdfX_mult( o1, r1, c4 ); stbir__simdfX_mult( o2, r2, c4 ); stbir__simdfX_mult( o3, r3, c4 ); 9928 stbir__simdfX_store( output4, o0 ); stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); ) 9929 stbIF5( stbir__simdfX_mult( o0, r0, c5 ); stbir__simdfX_mult( o1, r1, c5 ); stbir__simdfX_mult( o2, r2, c5 ); stbir__simdfX_mult( o3, r3, c5 ); 9930 stbir__simdfX_store( output5, o0 ); stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); ) 9931 stbIF6( stbir__simdfX_mult( o0, r0, c6 ); stbir__simdfX_mult( o1, r1, c6 ); stbir__simdfX_mult( o2, r2, c6 ); stbir__simdfX_mult( o3, r3, c6 ); 9932 stbir__simdfX_store( output6, o0 ); stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); ) 9933 stbIF7( stbir__simdfX_mult( o0, r0, c7 ); stbir__simdfX_mult( o1, r1, c7 ); stbir__simdfX_mult( o2, r2, c7 ); stbir__simdfX_mult( o3, r3, c7 ); 9934 stbir__simdfX_store( output7, o0 ); stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); ) 9935 #endif 9936 9937 input += (4*stbir__simdfX_float_count); 9938 stbIF0( output0 += (4*stbir__simdfX_float_count); ) stbIF1( output1 += (4*stbir__simdfX_float_count); ) stbIF2( output2 += (4*stbir__simdfX_float_count); ) stbIF3( output3 += (4*stbir__simdfX_float_count); ) stbIF4( output4 += (4*stbir__simdfX_float_count); ) stbIF5( output5 += (4*stbir__simdfX_float_count); ) stbIF6( output6 += (4*stbir__simdfX_float_count); ) stbIF7( output7 += (4*stbir__simdfX_float_count); ) 9939 } 9940 STBIR_SIMD_NO_UNROLL_LOOP_START 9941 while ( ( (char*)input_end - (char*) input ) >= 16 ) 9942 { 9943 stbir__simdf o0, r0; 9944 STBIR_SIMD_NO_UNROLL(output0); 9945 9946 stbir__simdf_load( r0, input ); 9947 9948 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 9949 stbIF0( stbir__simdf_load( o0, output0 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); stbir__simdf_store( output0, o0 ); ) 9950 stbIF1( stbir__simdf_load( o0, output1 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); stbir__simdf_store( output1, o0 ); ) 9951 stbIF2( stbir__simdf_load( o0, output2 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); stbir__simdf_store( output2, o0 ); ) 9952 stbIF3( stbir__simdf_load( o0, output3 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); stbir__simdf_store( output3, o0 ); ) 9953 stbIF4( stbir__simdf_load( o0, output4 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); stbir__simdf_store( output4, o0 ); ) 9954 stbIF5( stbir__simdf_load( o0, output5 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); stbir__simdf_store( output5, o0 ); ) 9955 stbIF6( stbir__simdf_load( o0, output6 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); stbir__simdf_store( output6, o0 ); ) 9956 stbIF7( stbir__simdf_load( o0, output7 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); stbir__simdf_store( output7, o0 ); ) 9957 #else 9958 stbIF0( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); stbir__simdf_store( output0, o0 ); ) 9959 stbIF1( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); stbir__simdf_store( output1, o0 ); ) 9960 stbIF2( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); stbir__simdf_store( output2, o0 ); ) 9961 stbIF3( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); stbir__simdf_store( output3, o0 ); ) 9962 stbIF4( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); stbir__simdf_store( output4, o0 ); ) 9963 stbIF5( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); stbir__simdf_store( output5, o0 ); ) 9964 stbIF6( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); stbir__simdf_store( output6, o0 ); ) 9965 stbIF7( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); stbir__simdf_store( output7, o0 ); ) 9966 #endif 9967 9968 input += 4; 9969 stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; ) 9970 } 9971 } 9972 #else 9973 STBIR_NO_UNROLL_LOOP_START 9974 while ( ( (char*)input_end - (char*) input ) >= 16 ) 9975 { 9976 float r0, r1, r2, r3; 9977 STBIR_NO_UNROLL(input); 9978 9979 r0 = input[0], r1 = input[1], r2 = input[2], r3 = input[3]; 9980 9981 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 9982 stbIF0( output0[0] += ( r0 * c0s ); output0[1] += ( r1 * c0s ); output0[2] += ( r2 * c0s ); output0[3] += ( r3 * c0s ); ) 9983 stbIF1( output1[0] += ( r0 * c1s ); output1[1] += ( r1 * c1s ); output1[2] += ( r2 * c1s ); output1[3] += ( r3 * c1s ); ) 9984 stbIF2( output2[0] += ( r0 * c2s ); output2[1] += ( r1 * c2s ); output2[2] += ( r2 * c2s ); output2[3] += ( r3 * c2s ); ) 9985 stbIF3( output3[0] += ( r0 * c3s ); output3[1] += ( r1 * c3s ); output3[2] += ( r2 * c3s ); output3[3] += ( r3 * c3s ); ) 9986 stbIF4( output4[0] += ( r0 * c4s ); output4[1] += ( r1 * c4s ); output4[2] += ( r2 * c4s ); output4[3] += ( r3 * c4s ); ) 9987 stbIF5( output5[0] += ( r0 * c5s ); output5[1] += ( r1 * c5s ); output5[2] += ( r2 * c5s ); output5[3] += ( r3 * c5s ); ) 9988 stbIF6( output6[0] += ( r0 * c6s ); output6[1] += ( r1 * c6s ); output6[2] += ( r2 * c6s ); output6[3] += ( r3 * c6s ); ) 9989 stbIF7( output7[0] += ( r0 * c7s ); output7[1] += ( r1 * c7s ); output7[2] += ( r2 * c7s ); output7[3] += ( r3 * c7s ); ) 9990 #else 9991 stbIF0( output0[0] = ( r0 * c0s ); output0[1] = ( r1 * c0s ); output0[2] = ( r2 * c0s ); output0[3] = ( r3 * c0s ); ) 9992 stbIF1( output1[0] = ( r0 * c1s ); output1[1] = ( r1 * c1s ); output1[2] = ( r2 * c1s ); output1[3] = ( r3 * c1s ); ) 9993 stbIF2( output2[0] = ( r0 * c2s ); output2[1] = ( r1 * c2s ); output2[2] = ( r2 * c2s ); output2[3] = ( r3 * c2s ); ) 9994 stbIF3( output3[0] = ( r0 * c3s ); output3[1] = ( r1 * c3s ); output3[2] = ( r2 * c3s ); output3[3] = ( r3 * c3s ); ) 9995 stbIF4( output4[0] = ( r0 * c4s ); output4[1] = ( r1 * c4s ); output4[2] = ( r2 * c4s ); output4[3] = ( r3 * c4s ); ) 9996 stbIF5( output5[0] = ( r0 * c5s ); output5[1] = ( r1 * c5s ); output5[2] = ( r2 * c5s ); output5[3] = ( r3 * c5s ); ) 9997 stbIF6( output6[0] = ( r0 * c6s ); output6[1] = ( r1 * c6s ); output6[2] = ( r2 * c6s ); output6[3] = ( r3 * c6s ); ) 9998 stbIF7( output7[0] = ( r0 * c7s ); output7[1] = ( r1 * c7s ); output7[2] = ( r2 * c7s ); output7[3] = ( r3 * c7s ); ) 9999 #endif 10000 10001 input += 4; 10002 stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; ) 10003 } 10004 #endif 10005 STBIR_NO_UNROLL_LOOP_START 10006 while ( input < input_end ) 10007 { 10008 float r = input[0]; 10009 STBIR_NO_UNROLL(output0); 10010 10011 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10012 stbIF0( output0[0] += ( r * c0s ); ) 10013 stbIF1( output1[0] += ( r * c1s ); ) 10014 stbIF2( output2[0] += ( r * c2s ); ) 10015 stbIF3( output3[0] += ( r * c3s ); ) 10016 stbIF4( output4[0] += ( r * c4s ); ) 10017 stbIF5( output5[0] += ( r * c5s ); ) 10018 stbIF6( output6[0] += ( r * c6s ); ) 10019 stbIF7( output7[0] += ( r * c7s ); ) 10020 #else 10021 stbIF0( output0[0] = ( r * c0s ); ) 10022 stbIF1( output1[0] = ( r * c1s ); ) 10023 stbIF2( output2[0] = ( r * c2s ); ) 10024 stbIF3( output3[0] = ( r * c3s ); ) 10025 stbIF4( output4[0] = ( r * c4s ); ) 10026 stbIF5( output5[0] = ( r * c5s ); ) 10027 stbIF6( output6[0] = ( r * c6s ); ) 10028 stbIF7( output7[0] = ( r * c7s ); ) 10029 #endif 10030 10031 ++input; 10032 stbIF0( ++output0; ) stbIF1( ++output1; ) stbIF2( ++output2; ) stbIF3( ++output3; ) stbIF4( ++output4; ) stbIF5( ++output5; ) stbIF6( ++output6; ) stbIF7( ++output7; ) 10033 } 10034 } 10035 10036 static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, float const * vertical_coefficients, float const ** inputs, float const * input0_end ) 10037 { 10038 float STBIR_SIMD_STREAMOUT_PTR( * ) output = outputp; 10039 10040 stbIF0( float const * input0 = inputs[0]; float c0s = vertical_coefficients[0]; ) 10041 stbIF1( float const * input1 = inputs[1]; float c1s = vertical_coefficients[1]; ) 10042 stbIF2( float const * input2 = inputs[2]; float c2s = vertical_coefficients[2]; ) 10043 stbIF3( float const * input3 = inputs[3]; float c3s = vertical_coefficients[3]; ) 10044 stbIF4( float const * input4 = inputs[4]; float c4s = vertical_coefficients[4]; ) 10045 stbIF5( float const * input5 = inputs[5]; float c5s = vertical_coefficients[5]; ) 10046 stbIF6( float const * input6 = inputs[6]; float c6s = vertical_coefficients[6]; ) 10047 stbIF7( float const * input7 = inputs[7]; float c7s = vertical_coefficients[7]; ) 10048 10049 #if ( STBIR__vertical_channels == 1 ) && !defined(STB_IMAGE_RESIZE_VERTICAL_CONTINUE) 10050 // check single channel one weight 10051 if ( ( c0s >= (1.0f-0.000001f) ) && ( c0s <= (1.0f+0.000001f) ) ) 10052 { 10053 STBIR_MEMCPY( output, input0, (char*)input0_end - (char*)input0 ); 10054 return; 10055 } 10056 #endif 10057 10058 #ifdef STBIR_SIMD 10059 { 10060 stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); ) 10061 stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); ) 10062 stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); ) 10063 stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); ) 10064 stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); ) 10065 stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); ) 10066 stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); ) 10067 stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); ) 10068 10069 STBIR_SIMD_NO_UNROLL_LOOP_START 10070 while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) ) 10071 { 10072 stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3; 10073 STBIR_SIMD_NO_UNROLL(output); 10074 10075 // prefetch four loop iterations ahead (doesn't affect much for small resizes, but helps with big ones) 10076 stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); ) 10077 stbIF1( stbir__prefetch( input1 + (16*stbir__simdfX_float_count) ); ) 10078 stbIF2( stbir__prefetch( input2 + (16*stbir__simdfX_float_count) ); ) 10079 stbIF3( stbir__prefetch( input3 + (16*stbir__simdfX_float_count) ); ) 10080 stbIF4( stbir__prefetch( input4 + (16*stbir__simdfX_float_count) ); ) 10081 stbIF5( stbir__prefetch( input5 + (16*stbir__simdfX_float_count) ); ) 10082 stbIF6( stbir__prefetch( input6 + (16*stbir__simdfX_float_count) ); ) 10083 stbIF7( stbir__prefetch( input7 + (16*stbir__simdfX_float_count) ); ) 10084 10085 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10086 stbIF0( stbir__simdfX_load( o0, output ); stbir__simdfX_load( o1, output+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output+(3*stbir__simdfX_float_count) ); 10087 stbir__simdfX_load( r0, input0 ); stbir__simdfX_load( r1, input0+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) ); 10088 stbir__simdfX_madd( o0, o0, r0, c0 ); stbir__simdfX_madd( o1, o1, r1, c0 ); stbir__simdfX_madd( o2, o2, r2, c0 ); stbir__simdfX_madd( o3, o3, r3, c0 ); ) 10089 #else 10090 stbIF0( stbir__simdfX_load( r0, input0 ); stbir__simdfX_load( r1, input0+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) ); 10091 stbir__simdfX_mult( o0, r0, c0 ); stbir__simdfX_mult( o1, r1, c0 ); stbir__simdfX_mult( o2, r2, c0 ); stbir__simdfX_mult( o3, r3, c0 ); ) 10092 #endif 10093 10094 stbIF1( stbir__simdfX_load( r0, input1 ); stbir__simdfX_load( r1, input1+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input1+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input1+(3*stbir__simdfX_float_count) ); 10095 stbir__simdfX_madd( o0, o0, r0, c1 ); stbir__simdfX_madd( o1, o1, r1, c1 ); stbir__simdfX_madd( o2, o2, r2, c1 ); stbir__simdfX_madd( o3, o3, r3, c1 ); ) 10096 stbIF2( stbir__simdfX_load( r0, input2 ); stbir__simdfX_load( r1, input2+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input2+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input2+(3*stbir__simdfX_float_count) ); 10097 stbir__simdfX_madd( o0, o0, r0, c2 ); stbir__simdfX_madd( o1, o1, r1, c2 ); stbir__simdfX_madd( o2, o2, r2, c2 ); stbir__simdfX_madd( o3, o3, r3, c2 ); ) 10098 stbIF3( stbir__simdfX_load( r0, input3 ); stbir__simdfX_load( r1, input3+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input3+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input3+(3*stbir__simdfX_float_count) ); 10099 stbir__simdfX_madd( o0, o0, r0, c3 ); stbir__simdfX_madd( o1, o1, r1, c3 ); stbir__simdfX_madd( o2, o2, r2, c3 ); stbir__simdfX_madd( o3, o3, r3, c3 ); ) 10100 stbIF4( stbir__simdfX_load( r0, input4 ); stbir__simdfX_load( r1, input4+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input4+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input4+(3*stbir__simdfX_float_count) ); 10101 stbir__simdfX_madd( o0, o0, r0, c4 ); stbir__simdfX_madd( o1, o1, r1, c4 ); stbir__simdfX_madd( o2, o2, r2, c4 ); stbir__simdfX_madd( o3, o3, r3, c4 ); ) 10102 stbIF5( stbir__simdfX_load( r0, input5 ); stbir__simdfX_load( r1, input5+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input5+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input5+(3*stbir__simdfX_float_count) ); 10103 stbir__simdfX_madd( o0, o0, r0, c5 ); stbir__simdfX_madd( o1, o1, r1, c5 ); stbir__simdfX_madd( o2, o2, r2, c5 ); stbir__simdfX_madd( o3, o3, r3, c5 ); ) 10104 stbIF6( stbir__simdfX_load( r0, input6 ); stbir__simdfX_load( r1, input6+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input6+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input6+(3*stbir__simdfX_float_count) ); 10105 stbir__simdfX_madd( o0, o0, r0, c6 ); stbir__simdfX_madd( o1, o1, r1, c6 ); stbir__simdfX_madd( o2, o2, r2, c6 ); stbir__simdfX_madd( o3, o3, r3, c6 ); ) 10106 stbIF7( stbir__simdfX_load( r0, input7 ); stbir__simdfX_load( r1, input7+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input7+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input7+(3*stbir__simdfX_float_count) ); 10107 stbir__simdfX_madd( o0, o0, r0, c7 ); stbir__simdfX_madd( o1, o1, r1, c7 ); stbir__simdfX_madd( o2, o2, r2, c7 ); stbir__simdfX_madd( o3, o3, r3, c7 ); ) 10108 10109 stbir__simdfX_store( output, o0 ); stbir__simdfX_store( output+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output+(3*stbir__simdfX_float_count), o3 ); 10110 output += (4*stbir__simdfX_float_count); 10111 stbIF0( input0 += (4*stbir__simdfX_float_count); ) stbIF1( input1 += (4*stbir__simdfX_float_count); ) stbIF2( input2 += (4*stbir__simdfX_float_count); ) stbIF3( input3 += (4*stbir__simdfX_float_count); ) stbIF4( input4 += (4*stbir__simdfX_float_count); ) stbIF5( input5 += (4*stbir__simdfX_float_count); ) stbIF6( input6 += (4*stbir__simdfX_float_count); ) stbIF7( input7 += (4*stbir__simdfX_float_count); ) 10112 } 10113 10114 STBIR_SIMD_NO_UNROLL_LOOP_START 10115 while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) 10116 { 10117 stbir__simdf o0, r0; 10118 STBIR_SIMD_NO_UNROLL(output); 10119 10120 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10121 stbIF0( stbir__simdf_load( o0, output ); stbir__simdf_load( r0, input0 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); ) 10122 #else 10123 stbIF0( stbir__simdf_load( r0, input0 ); stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); ) 10124 #endif 10125 stbIF1( stbir__simdf_load( r0, input1 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); ) 10126 stbIF2( stbir__simdf_load( r0, input2 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); ) 10127 stbIF3( stbir__simdf_load( r0, input3 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); ) 10128 stbIF4( stbir__simdf_load( r0, input4 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); ) 10129 stbIF5( stbir__simdf_load( r0, input5 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); ) 10130 stbIF6( stbir__simdf_load( r0, input6 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); ) 10131 stbIF7( stbir__simdf_load( r0, input7 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); ) 10132 10133 stbir__simdf_store( output, o0 ); 10134 output += 4; 10135 stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; ) 10136 } 10137 } 10138 #else 10139 STBIR_NO_UNROLL_LOOP_START 10140 while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) 10141 { 10142 float o0, o1, o2, o3; 10143 STBIR_NO_UNROLL(output); 10144 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10145 stbIF0( o0 = output[0] + input0[0] * c0s; o1 = output[1] + input0[1] * c0s; o2 = output[2] + input0[2] * c0s; o3 = output[3] + input0[3] * c0s; ) 10146 #else 10147 stbIF0( o0 = input0[0] * c0s; o1 = input0[1] * c0s; o2 = input0[2] * c0s; o3 = input0[3] * c0s; ) 10148 #endif 10149 stbIF1( o0 += input1[0] * c1s; o1 += input1[1] * c1s; o2 += input1[2] * c1s; o3 += input1[3] * c1s; ) 10150 stbIF2( o0 += input2[0] * c2s; o1 += input2[1] * c2s; o2 += input2[2] * c2s; o3 += input2[3] * c2s; ) 10151 stbIF3( o0 += input3[0] * c3s; o1 += input3[1] * c3s; o2 += input3[2] * c3s; o3 += input3[3] * c3s; ) 10152 stbIF4( o0 += input4[0] * c4s; o1 += input4[1] * c4s; o2 += input4[2] * c4s; o3 += input4[3] * c4s; ) 10153 stbIF5( o0 += input5[0] * c5s; o1 += input5[1] * c5s; o2 += input5[2] * c5s; o3 += input5[3] * c5s; ) 10154 stbIF6( o0 += input6[0] * c6s; o1 += input6[1] * c6s; o2 += input6[2] * c6s; o3 += input6[3] * c6s; ) 10155 stbIF7( o0 += input7[0] * c7s; o1 += input7[1] * c7s; o2 += input7[2] * c7s; o3 += input7[3] * c7s; ) 10156 output[0] = o0; output[1] = o1; output[2] = o2; output[3] = o3; 10157 output += 4; 10158 stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; ) 10159 } 10160 #endif 10161 STBIR_NO_UNROLL_LOOP_START 10162 while ( input0 < input0_end ) 10163 { 10164 float o0; 10165 STBIR_NO_UNROLL(output); 10166 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10167 stbIF0( o0 = output[0] + input0[0] * c0s; ) 10168 #else 10169 stbIF0( o0 = input0[0] * c0s; ) 10170 #endif 10171 stbIF1( o0 += input1[0] * c1s; ) 10172 stbIF2( o0 += input2[0] * c2s; ) 10173 stbIF3( o0 += input3[0] * c3s; ) 10174 stbIF4( o0 += input4[0] * c4s; ) 10175 stbIF5( o0 += input5[0] * c5s; ) 10176 stbIF6( o0 += input6[0] * c6s; ) 10177 stbIF7( o0 += input7[0] * c7s; ) 10178 output[0] = o0; 10179 ++output; 10180 stbIF0( ++input0; ) stbIF1( ++input1; ) stbIF2( ++input2; ) stbIF3( ++input3; ) stbIF4( ++input4; ) stbIF5( ++input5; ) stbIF6( ++input6; ) stbIF7( ++input7; ) 10181 } 10182 } 10183 10184 #undef stbIF0 10185 #undef stbIF1 10186 #undef stbIF2 10187 #undef stbIF3 10188 #undef stbIF4 10189 #undef stbIF5 10190 #undef stbIF6 10191 #undef stbIF7 10192 #undef STB_IMAGE_RESIZE_DO_VERTICALS 10193 #undef STBIR__vertical_channels 10194 #undef STB_IMAGE_RESIZE_DO_HORIZONTALS 10195 #undef STBIR_strs_join24 10196 #undef STBIR_strs_join14 10197 #undef STBIR_chans 10198 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10199 #undef STB_IMAGE_RESIZE_VERTICAL_CONTINUE 10200 #endif 10201 10202 #else // !STB_IMAGE_RESIZE_DO_VERTICALS 10203 10204 #define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__horizontal_channels,end) 10205 10206 #ifndef stbir__2_coeff_only 10207 #define stbir__2_coeff_only() \ 10208 stbir__1_coeff_only(); \ 10209 stbir__1_coeff_remnant(1); 10210 #endif 10211 10212 #ifndef stbir__2_coeff_remnant 10213 #define stbir__2_coeff_remnant( ofs ) \ 10214 stbir__1_coeff_remnant(ofs); \ 10215 stbir__1_coeff_remnant((ofs)+1); 10216 #endif 10217 10218 #ifndef stbir__3_coeff_only 10219 #define stbir__3_coeff_only() \ 10220 stbir__2_coeff_only(); \ 10221 stbir__1_coeff_remnant(2); 10222 #endif 10223 10224 #ifndef stbir__3_coeff_remnant 10225 #define stbir__3_coeff_remnant( ofs ) \ 10226 stbir__2_coeff_remnant(ofs); \ 10227 stbir__1_coeff_remnant((ofs)+2); 10228 #endif 10229 10230 #ifndef stbir__3_coeff_setup 10231 #define stbir__3_coeff_setup() 10232 #endif 10233 10234 #ifndef stbir__4_coeff_start 10235 #define stbir__4_coeff_start() \ 10236 stbir__2_coeff_only(); \ 10237 stbir__2_coeff_remnant(2); 10238 #endif 10239 10240 #ifndef stbir__4_coeff_continue_from_4 10241 #define stbir__4_coeff_continue_from_4( ofs ) \ 10242 stbir__2_coeff_remnant(ofs); \ 10243 stbir__2_coeff_remnant((ofs)+2); 10244 #endif 10245 10246 #ifndef stbir__store_output_tiny 10247 #define stbir__store_output_tiny stbir__store_output 10248 #endif 10249 10250 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_1_coeff)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10251 { 10252 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10253 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10254 STBIR_SIMD_NO_UNROLL_LOOP_START 10255 do { 10256 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10257 float const * hc = horizontal_coefficients; 10258 stbir__1_coeff_only(); 10259 stbir__store_output_tiny(); 10260 } while ( output < output_end ); 10261 } 10262 10263 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_2_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10264 { 10265 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10266 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10267 STBIR_SIMD_NO_UNROLL_LOOP_START 10268 do { 10269 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10270 float const * hc = horizontal_coefficients; 10271 stbir__2_coeff_only(); 10272 stbir__store_output_tiny(); 10273 } while ( output < output_end ); 10274 } 10275 10276 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_3_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10277 { 10278 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10279 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10280 STBIR_SIMD_NO_UNROLL_LOOP_START 10281 do { 10282 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10283 float const * hc = horizontal_coefficients; 10284 stbir__3_coeff_only(); 10285 stbir__store_output_tiny(); 10286 } while ( output < output_end ); 10287 } 10288 10289 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_4_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10290 { 10291 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10292 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10293 STBIR_SIMD_NO_UNROLL_LOOP_START 10294 do { 10295 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10296 float const * hc = horizontal_coefficients; 10297 stbir__4_coeff_start(); 10298 stbir__store_output(); 10299 } while ( output < output_end ); 10300 } 10301 10302 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_5_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10303 { 10304 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10305 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10306 STBIR_SIMD_NO_UNROLL_LOOP_START 10307 do { 10308 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10309 float const * hc = horizontal_coefficients; 10310 stbir__4_coeff_start(); 10311 stbir__1_coeff_remnant(4); 10312 stbir__store_output(); 10313 } while ( output < output_end ); 10314 } 10315 10316 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_6_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10317 { 10318 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10319 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10320 STBIR_SIMD_NO_UNROLL_LOOP_START 10321 do { 10322 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10323 float const * hc = horizontal_coefficients; 10324 stbir__4_coeff_start(); 10325 stbir__2_coeff_remnant(4); 10326 stbir__store_output(); 10327 } while ( output < output_end ); 10328 } 10329 10330 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_7_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10331 { 10332 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10333 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10334 stbir__3_coeff_setup(); 10335 STBIR_SIMD_NO_UNROLL_LOOP_START 10336 do { 10337 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10338 float const * hc = horizontal_coefficients; 10339 10340 stbir__4_coeff_start(); 10341 stbir__3_coeff_remnant(4); 10342 stbir__store_output(); 10343 } while ( output < output_end ); 10344 } 10345 10346 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_8_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10347 { 10348 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10349 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10350 STBIR_SIMD_NO_UNROLL_LOOP_START 10351 do { 10352 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10353 float const * hc = horizontal_coefficients; 10354 stbir__4_coeff_start(); 10355 stbir__4_coeff_continue_from_4(4); 10356 stbir__store_output(); 10357 } while ( output < output_end ); 10358 } 10359 10360 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_9_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10361 { 10362 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10363 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10364 STBIR_SIMD_NO_UNROLL_LOOP_START 10365 do { 10366 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10367 float const * hc = horizontal_coefficients; 10368 stbir__4_coeff_start(); 10369 stbir__4_coeff_continue_from_4(4); 10370 stbir__1_coeff_remnant(8); 10371 stbir__store_output(); 10372 } while ( output < output_end ); 10373 } 10374 10375 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_10_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10376 { 10377 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10378 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10379 STBIR_SIMD_NO_UNROLL_LOOP_START 10380 do { 10381 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10382 float const * hc = horizontal_coefficients; 10383 stbir__4_coeff_start(); 10384 stbir__4_coeff_continue_from_4(4); 10385 stbir__2_coeff_remnant(8); 10386 stbir__store_output(); 10387 } while ( output < output_end ); 10388 } 10389 10390 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_11_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10391 { 10392 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10393 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10394 stbir__3_coeff_setup(); 10395 STBIR_SIMD_NO_UNROLL_LOOP_START 10396 do { 10397 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10398 float const * hc = horizontal_coefficients; 10399 stbir__4_coeff_start(); 10400 stbir__4_coeff_continue_from_4(4); 10401 stbir__3_coeff_remnant(8); 10402 stbir__store_output(); 10403 } while ( output < output_end ); 10404 } 10405 10406 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_12_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10407 { 10408 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10409 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10410 STBIR_SIMD_NO_UNROLL_LOOP_START 10411 do { 10412 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10413 float const * hc = horizontal_coefficients; 10414 stbir__4_coeff_start(); 10415 stbir__4_coeff_continue_from_4(4); 10416 stbir__4_coeff_continue_from_4(8); 10417 stbir__store_output(); 10418 } while ( output < output_end ); 10419 } 10420 10421 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod0 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10422 { 10423 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10424 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10425 STBIR_SIMD_NO_UNROLL_LOOP_START 10426 do { 10427 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10428 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2; 10429 float const * hc = horizontal_coefficients; 10430 10431 stbir__4_coeff_start(); 10432 STBIR_SIMD_NO_UNROLL_LOOP_START 10433 do { 10434 hc += 4; 10435 decode += STBIR__horizontal_channels * 4; 10436 stbir__4_coeff_continue_from_4( 0 ); 10437 --n; 10438 } while ( n > 0 ); 10439 stbir__store_output(); 10440 } while ( output < output_end ); 10441 } 10442 10443 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod1 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10444 { 10445 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10446 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10447 STBIR_SIMD_NO_UNROLL_LOOP_START 10448 do { 10449 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10450 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2; 10451 float const * hc = horizontal_coefficients; 10452 10453 stbir__4_coeff_start(); 10454 STBIR_SIMD_NO_UNROLL_LOOP_START 10455 do { 10456 hc += 4; 10457 decode += STBIR__horizontal_channels * 4; 10458 stbir__4_coeff_continue_from_4( 0 ); 10459 --n; 10460 } while ( n > 0 ); 10461 stbir__1_coeff_remnant( 4 ); 10462 stbir__store_output(); 10463 } while ( output < output_end ); 10464 } 10465 10466 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod2 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10467 { 10468 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10469 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10470 STBIR_SIMD_NO_UNROLL_LOOP_START 10471 do { 10472 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10473 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2; 10474 float const * hc = horizontal_coefficients; 10475 10476 stbir__4_coeff_start(); 10477 STBIR_SIMD_NO_UNROLL_LOOP_START 10478 do { 10479 hc += 4; 10480 decode += STBIR__horizontal_channels * 4; 10481 stbir__4_coeff_continue_from_4( 0 ); 10482 --n; 10483 } while ( n > 0 ); 10484 stbir__2_coeff_remnant( 4 ); 10485 10486 stbir__store_output(); 10487 } while ( output < output_end ); 10488 } 10489 10490 static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ) 10491 { 10492 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; 10493 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; 10494 stbir__3_coeff_setup(); 10495 STBIR_SIMD_NO_UNROLL_LOOP_START 10496 do { 10497 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 10498 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2; 10499 float const * hc = horizontal_coefficients; 10500 10501 stbir__4_coeff_start(); 10502 STBIR_SIMD_NO_UNROLL_LOOP_START 10503 do { 10504 hc += 4; 10505 decode += STBIR__horizontal_channels * 4; 10506 stbir__4_coeff_continue_from_4( 0 ); 10507 --n; 10508 } while ( n > 0 ); 10509 stbir__3_coeff_remnant( 4 ); 10510 10511 stbir__store_output(); 10512 } while ( output < output_end ); 10513 } 10514 10515 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_funcs)[4]= 10516 { 10517 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0), 10518 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1), 10519 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2), 10520 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3), 10521 }; 10522 10523 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_funcs)[12]= 10524 { 10525 STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff), 10526 STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs), 10527 STBIR_chans(stbir__horizontal_gather_,_channels_with_3_coeffs), 10528 STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs), 10529 STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs), 10530 STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs), 10531 STBIR_chans(stbir__horizontal_gather_,_channels_with_7_coeffs), 10532 STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs), 10533 STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs), 10534 STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs), 10535 STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs), 10536 STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs), 10537 }; 10538 10539 #undef STBIR__horizontal_channels 10540 #undef STB_IMAGE_RESIZE_DO_HORIZONTALS 10541 #undef stbir__1_coeff_only 10542 #undef stbir__1_coeff_remnant 10543 #undef stbir__2_coeff_only 10544 #undef stbir__2_coeff_remnant 10545 #undef stbir__3_coeff_only 10546 #undef stbir__3_coeff_remnant 10547 #undef stbir__3_coeff_setup 10548 #undef stbir__4_coeff_start 10549 #undef stbir__4_coeff_continue_from_4 10550 #undef stbir__store_output 10551 #undef stbir__store_output_tiny 10552 #undef STBIR_chans 10553 10554 #endif // HORIZONALS 10555 10556 #undef STBIR_strs_join2 10557 #undef STBIR_strs_join1 10558 10559 #endif // STB_IMAGE_RESIZE_DO_HORIZONTALS/VERTICALS/CODERS 10560 10561 /* 10562 ------------------------------------------------------------------------------ 10563 This software is available under 2 licenses -- choose whichever you prefer. 10564 ------------------------------------------------------------------------------ 10565 ALTERNATIVE A - MIT License 10566 Copyright (c) 2017 Sean Barrett 10567 Permission is hereby granted, free of charge, to any person obtaining a copy of 10568 this software and associated documentation files (the "Software"), to deal in 10569 the Software without restriction, including without limitation the rights to 10570 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 10571 of the Software, and to permit persons to whom the Software is furnished to do 10572 so, subject to the following conditions: 10573 The above copyright notice and this permission notice shall be included in all 10574 copies or substantial portions of the Software. 10575 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10576 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 10577 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 10578 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 10579 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 10580 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 10581 SOFTWARE. 10582 ------------------------------------------------------------------------------ 10583 ALTERNATIVE B - Public Domain (www.unlicense.org) 10584 This is free and unencumbered software released into the public domain. 10585 Anyone is free to copy, modify, publish, use, compile, sell, or distribute this 10586 software, either in source code form or as a compiled binary, for any purpose, 10587 commercial or non-commercial, and by any means. 10588 In jurisdictions that recognize copyright laws, the author or authors of this 10589 software dedicate any and all copyright interest in the software to the public 10590 domain. We make this dedication for the benefit of the public at large and to 10591 the detriment of our heirs and successors. We intend this dedication to be an 10592 overt act of relinquishment in perpetuity of all present and future rights to 10593 this software under copyright law. 10594 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 10595 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 10596 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 10597 AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 10598 ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 10599 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 10600 ------------------------------------------------------------------------------ 10601 */